Hey SJ --
Nice post! I wanted to offer some thoughts as you scope out this research direction in case it was useful
What notion of natural kind are you employing? I default back to a Quinean view when otherwise not specified, in which case it seems totally legitimate to say that agency is a useful concept within our scientific framework and constitutes a natural kind as much as various other concepts in biology. You might be holding the concept to a higher standard, but the post doesn't clarify what that standard is, and so it's unclear whether your argument that rules out agency would prove too much.
The focus on VNM-coherence seems misplaced. As other commenters have pointed out, there are many concepts of agency besides utility maximisation (e.g., Active Inference, accounts which abandon the fourth postulate, coalitional agency, etc.). Personally I would consider VNM-coherence to be non-central to the concept of agency -- I have given talks about agency before where I don't mention VNM at all and have no trouble conveying the central points. VNM-coherence seems more relevant to issues of decision-theory and rationality than agency. So critiquing the VNM framework seems to me to miss the point of discussing / analysing / using agency as a concept.
If you wanted to improve the argumentation here, I think concretely it would benefit from:
- Better clarity about which notion of natural kind you are talking about
- An argument for why that notion matters / is something we should care about (and ideally a comparison to concepts within AIS which you do think constitute natural kinds which you can cf. agency against)
- Disentangling whether you are talking about agency or about rationality (Or more generally, specifying the concept of agency which you don't think is a natural kind)
Happy to talk more if it would be useful!
Thanks so much Edward, I really appreciate these points!
My understanding of a natural kind is a category that reflects how the world is rather than being constructed in service of our particular interests. To me x is a natural kind should mean something like there is a clear boundary that separates things that are x from things that are not x or xness is defined in terms of a clear property that all things possess to a particular extent. I don't think agency, or being an agent, is really like that. However, I also think that the concept of being a natural kind is probably not, itself, a natural kind, in the sense that what kinds of boundaries or properties we take to be clear is itself something that is determined as much, if not more, by our interests than the fundamental nature of things. So I certainly don't mean that not being a natural kind is a bad thing. I just think that when we treat agency as being more of a natural kind then it is that this can lead us to the kind of teleological thinking about the ultimate nature of 'agency' that I am concerned about here.
My actual inspiration for this argument did not come from ontology though, and I am not sure that talking about it in terms of natural kinds and constructs was the most helpful way of presenting things, it was just the best I could come up with. What I was really thinking about was Derek Parfit's reductionist account of personal identity. On this view, a person's identity is not a fundamental fact about the world, there is actually no clear boundary around the self that would work the way we expect it to. Rather, questions about personal identity are really questions about other facts, what Parfit called 'relation-r', and my identity over time is really just a useful description of these facts that makes sense in many practical circumstances but can break down in others. I think agency is like this. When we want to explore a systems agency what we really want to do is to understand a bunch of other things about how the system makes decisions and acts on them. Agency is a useful description that summarises these facts in many practical cases (such as human to human interactions), but if we then use that same description and apply it to other kinds of system, like AIs, this description could be leading us astray.
I take your point that my focus on VNM style coherence arguments may be misplaced here and that there are other, more important, arguments I should be considering. I do think that in the general discourse around AI Safety there is an assumption that agents will inevitably be led to VNM utilities but I think you have a lot more direct experience of the discourse there than I do. On the other hand, I do feel like there could be some terminological disagreements here, I agree with you that these are really claims about rationality rather than agency but I think that there is then often a hidden assumption that agents will inevitably seek to become rational. I would love to talk with you more about this!
In traditional AI, Russell & Norvig define "agent" broadly. Other senses of "agent" and "agency" (ethical, financial, psychological, CFAR-handbook, etc.) do not seem to point at a unified concept, but at a wide range of traits and descriptions. Agents₁ can be agents₂ without being agents₃.
Thanks Karl, that's a really useful shortform and I'm sorry I wasn't already aware of it. This is exactly the kind of thing I had in mind when I said "there are many agentic lenses people have constructed". I am planning a follow up point about this but for now I would like to link to your shortform there to highlight what I mean.
I do think there is a general sense of the terms agent and agency, as referring to goal directness alongside something like coherence seeking or strategic planning, that get bandied about in AI safety, which is what I am trying to refer to here. However, I absolutely agree with you that if we were more specific about our use of the term and the specific context within which each use makes sense the kind of teleological thinking that I am arguing against here would likely not emerge.
I'd argue that evolution (and indeed RL) makes agency a rather natural concept (if not a "natural kind" — which I gather you are using in the Philosophy of Science sense: something like "a concept that carves nature at the joints"), especially for any organism that's motile and has senses and manipulation abilities.
However, I would completely agree that the widespread simple model of agency as something like a blank-slate Homo economicus with a utility function is a drastic over-idealization, and that being a loose hierarchical bundle of evolved/learned heuristics is the natural state for an organism/agent (e.g. Shard Theory). If you try to express the actual behavior of a human, or any biological organism, as a utility function – the evolved-and-learned heuristic approximation to their relative inclusive evolutionary fitness that their behavior embodies – you end up with something extremely complex (as in with a Kolmogorov complexity at least comparable to their genome size), and that is, in places around the edges, either Dutch-bookable or lacking clear preference orders between some alternatives, so not actually a utility function.
I'd also add that when we distill agency from humanity into base-model LLM personas, we end up with an LLM psychology that looks like a pretty good copy of that state. Since that's the raw material for Alignment, having a detailed understanding of it seems very important.
Thanks Roger, I agree with you, although I also think we should be careful not to overstate the optimising power of either evolution or reinforcement learning. These are optimising processes to be sure but evolution certainly isn't very good at finding absolute optimums and regularly seems to allow highly suboptimal solutions to problems as 'good enough'. RL is probably a lot better but still imperfect. So I agree that some degree of, say, goal directness is likely to emerge from these processes but it may remain imperfect even if they were left to run for a very long time, especially if any of the goal, the system's basic substrate, or its environment were sufficiently complex to begin with.
Interesting post! I think I mostly resonate with the core claim here as I understand it that reasoning about utility maximising AIs might not be super useful for AI safety, especially in the near term. However, I do think that a roughly agentic/goal-driven-entities lens is very useful for thinking about how intelligent autonomous systems might go bad. Maybe it's worth separating out broader agentic-like-things vs the more specific VNM-style utility maximisers. I think there's a risk here of criticising the latter and using this as evidence against reasoning in a style that still leverages the former at all.
I'd be interested in your thoughts on Shard Theory, which Roger also mentions, and additionally Developmental Cognitive Interpretability, my own research stance having tried to internalise a broader view of agency that leans less on the VNM-style of agency.
That said, it does seem like there are some strong arguments for concerning ones self with narrower views of agency that restrict to goal-directed systems. I think Vanessa Kosoy's comment on one of the articles you link is quite good, and I also quite like this post. I'm interested in your thoughts on these too.
Thanks Jason, I really appreciate these thoughts.
My immediate reaction to your first paragraph is that I think I might have a slightly different way of thinking about this to you, and potentially to most people in this space. I generally agree with you that "roughly agentic/goal-driven-entities lens is very useful for thinking about how intelligent autonomous systems might go bad" - however the question is not merely how things might go bad but how we can make them go well. Leo Szilard once told workers on the Manhattan Project that they should treat enriched uranium "that a mule that is trying to kick you" because otherwise they tended to be overly complaicent about the risk of working with it. This clearly wasn't directly justified as an ascription of agency to the uranium, but it was still a useful safety measure because imagining that the uranium wanted to form critical masses that would explode was a helpful way for thinking about how things might go bad! However, one difference in this case was that it made no difference to the uranium how we conceptualised its decision making, nor did it impact on our scientific understanding of the uranium to think about it this way. Neither of these things is true for AI. So I think it's possible that using a goal directed lens might be very useful for thinking about how intelligent autonomous systems might go bad, but if then go on and use that lens in modelling their future trajectories this could restrict the possibility space we are willing to consider, and if we talk about these systems as goal directed in ways that get into their training data or otherwise adjust our training methods around this assumption then this could become something of a self-fulfilling prophecy. That is what I was trying to say at the end of the article, but I think your comment has prompted me to think about it in a way that I hope makes this point sharper!
I think shard theory is a really interesting hypothesis and model for human value learning. I didn't know about the research agenda you and Edward were developing but it also looks super interesting and I would love to discuss it with you! One of the things that I like about shard theory but don't currently see in your agenda is that it tries to account for meta-agency, the way in which agents start trying to develop sub-agents within themselves to help them manage their goals. Within shard theory this is handled by a bidding mechanism where different shards compete to determine the correct response to a stimulus. I tend to think of it as potentially a fully formed agent in its own right with its own boundaries, belief like, and desire like states, like the concepts of 'wise mind' or 'loving awareness' I keep coming across in meditation and therapy. One of the things I find so fascinating about human meta-agency is that it seems extremely flexible, some people use their meta-agency to ruthlessly optimise goals towards some single purpose, some to develop balance, equanimity, or internal diversity of their own sakes, some to realise the impermanence and futility of their goals and desires. I am interested in how we can develop theories of meta-agency that don't dismiss this kind of flexibility out of hand but can explain why meta-agency might develop in different ways even across people/systems that share similar basic architectures and environmental contexts! In your experiments on retraining agents, did you come across anything that seemed to be governing how the agents evaluated, compared, or selected between their original and new goals in a way analogous to this kind of meta-agency?
My reaction to Vanessa Kosoy's comment is that I basically agree with her, and I still think the kind of reflection I am making here is useful. She gives three reasons for why In AI safety, we are from the get-go interested in goal-directed systems. One of these, "we are worried about systems with bad goals" - I think I have already dealt with, yes we should worry about that but we shouldn't then jump to the conclusion that this tells us what these systems will actually be like. The other two "we want AIs to achieve goals for us" and "stopping systems with bad goals is also a goal" are very interesting to me because they turn the question back on us. I'm not saying her argument is circular here but there is a sense in which the claim here is that "we are interested in goal directed systems because we are directing systems towards goals" - well is that the only thing we could be doing? I think that when we think about aligning people or institutions we often don't take this approach, we endorse vague mission statements, complex decision-making processes with checks and balances, vague but useful social norms and the like. Of course her next comment is "so what is your proposal?" I don't have one yet and that is a clear weakness. However, I don't think that this means no other proposals exist. My current interest is precisely in following this move of turning the question back on ourselves, and understanding alignment not as a purely technical process of giving the AI a well-specified objective, but a sociotechnical one of developing human-AI interactions that are long-term sustainable and beneficial to us, and that is only possible if we are willing to reflect deeply on what we are doing and why and not just on what super intelligence might do and why.
Finally, on Veedracs post. I am slightly less persuaded by this. My main reason is just that I think pure optimisation is actually a very rare process to find at the mesa scale. Obviously it happens at the level of fundamental laws of physics, but when larger systems try to pursue this kind of strategy they tend to collapse for one reason or another. Even RL is most effective when it is not pure optimisation but a stocastic process. I do think that if we developed systems that were strong optimisers then it is that and not their agency per se that might doom us. However, I don't think we should do that and I don't think we have to either. Maybe that's not such a good response though?
Epistemic status: trying to articulate a big idea which I feel is important but underexplored, partly because it is hard to frame clearly - may not be framing it clearly yet!
Agency, both natural and artificial, is a very important concept. Understanding agency allows us to model our own behaviour and that of others, and it is thus one of the most predictively useful concepts we have at our disposal. In its ordinary, folk-psychological sense, agents are ‘like us’ in important behavioural respects, more or less, meaning we can use thoughts like ‘what would I do if I were them’ to good effect.
However, that does not mean agency is a natural kind. The truth is that we are not the people we imagine ourselves to be, and neither are the humans, animals, complex systems, or even inanimate objects we are prone to thinking of as fellow agents. We are, in fact, nothing but a bunch of hierarchically ordered biological processes in a trench coat. Our behaviour is not neatly determined by our thoughts and ideas, but by a complex mesh of impulses, desires, emotions, and heuristics that are often no less confusing (even, or especially, to the highly intelligent and introspective among us) than those mysterious entities we call other people. Nor are increasingly agentic AIs much of an improvement. While early agents trained directly from reinforcement learning may be conceptually simpler than we are, because their policy function is directly optimized into their weights, systems that simulate agency as an emergent phenomenon from some other process, such as next-token prediction, are just as complex and messy, combining their base model’s stochastic inclinations with the way that their simulated personas move them through semantic space. Agency is a construct that we have developed to help make sense of this mess, but it is only a lens through which we view the world. Indeed, there are many agentic lenses people have constructed (HT to Karl Kruger for pointing me to this useful summary he wrote in the comments), and the kind of lens you use can profoundly influence how you view the world, and yourself.
When engaging in practical work, this sort of claim, that ‘[x] is a construct and the reality is a lot more complicated’, can seem unhelpful. Of course, we all know this, but the point is that agency is a very useful and predictive construct (as are many others, from money and weeds to temperature and species), and we can surely make more progress with it than without it. Obviously, I agree.
The problem is that when we start talking about agents as a natural kind, a fundamentally different type of thing from non-agents in our ontology, we often smuggle a kind of teleology in via the back door. We also assume that our simplified model for how agency works, roughly goal-directed utility maximization, describes what ‘real’ agents do. The fact that all the actually existing agency we see, including our own very imperfect muddling through, isn’t like this only goes to show its imperfection, its pseudo-agency if you will. The alternative I would advocate for is viewing agency as a naturally emergent phenomenon that is built up from other phenomena (such as boundary maintenance, self-modelling, information processing, and so forth) and could continue being built up ad infinitum without necessarily being drawn into such an ideal.
Of course, there are arguments for why this teleology is justified. The best known is that agents whose preferences don't conform to utility maximization can be ‘money-pumped’ (led to pay a cost only to end up where they began) and so dominated by those that do. However, the theoretical basis for such claims is more shaky than is often assumed. These arguments assume preference completeness (that for any two options an agent prefers one or counts them equal) and derive a utility function from it; they never show that agents must have complete preferences in the first place; and an agent can escape the money pump without them. Suppose, with Derek Parfit, that I hold some goods as only roughly comparable: I might prefer being a good writer to a bad one, and a good lawyer to a bad one, yet have no preference between being a good writer and a good lawyer. That wouldn’t necessarily make me exploitable, so long as I spot the money pump game and avoid playing. I need only refuse to trade my current career for any alternative that isn't strictly better (not merely roughly comparable), which breaks the cycle without ever ranking writing against lawyering. One might object that a policy like this just is a utility function under another name, as it still leaves the agent with a set of preferences that is representable as maximizing something. But "representable as maximizing something" is nearly trivial here, since almost any behaviour qualifies. What the threat of domination would actually need to force to justify this teleology is a single cardinal ranking of outcomes, and that is precisely what incomplete preferences withhold.
There are also practical reasons why AI safety researchers often wish to defend this view about agency - it plays a central role in some of the most classic and widely respected arguments for why AI is dangerous, such as Bostrom's Superintelligent Will. Indeed, some of the best critiques of AI risk consist largely of questioning these arguments. Yet, these are hardly the only arguments for why superintelligent systems could pose a threat to humanity, and there are more reasons for wanting to explore the fundamental nature of agency than trying to show that AI risk research may be misguided. In any case, it is certainly not my view that alternative views of what agency is will render AI safety trivial or easy!
However, there are reasons why a more thorough and grounded, and less teleological, approach to thinking about the nature of agency could be helpful for developing safer AI. One is that humans' conception of our own agency and that of others influences how we behave, and it is reasonable to assume that the same is true of AI. Consider the following possible people. One conceives of agency as a false construct tying them to an unsatisfactory life of striving that they are endeavouring to dissolve through rigorous meditation and cultivating love for the inherent worth of all things. The other believes they are homo-economicus incarnate, and the only thing stopping everyone murdering their neighbours for the rings on their fingers is well-designed social incentives. I’m not saying either of these is inherently more aligned or easier to align. However, I also don’t think either is more correct about the nature of agency or more of an agent in how they embody it. What I do think is that, if I were trying to get these people to be nice to me, I would probably go about it quite differently and expect rather different results from them. Of course, the reality for most people is even messier than these toy examples, but our social norms and behaviours are surprisingly well adapted to handle this complexity. I think that is one reason why our everyday moral judgments are often more useful in social alignment than ethical theories.
So, before insisting goal-directed utility maximization is the only form advanced AI could take, I think it is perhaps helpful to make sure we are not obscuring a messy reality of actual AI agency with our, often teleological, assumptions about what it should look like. And perhaps by influencing the kinds of agency AIs go on to develop, we can build another lever to help move us away from the worst of the danger.