It is hard for me to view this epistemic state as per-paradigmatic.
The claim "we're confused about agency" is not a claim that our theory of agency is pre-paradigmatic. It's a claim that the current paradigm (of expected utility maximization) contains a bunch of anomalies and limitations, suggesting that there's a much better paradigm waiting to be found.
If the goal of agent foundations research is to solve AI safety, it is best to focus on the confusions that seem relevant to that effort.
This is a bad model of how scientific progress works. Instead, people should focus on the clearest anomalies/limitations as "clues" to finding a new paradigm that will change many aspects of how we think about agency.
Analogously, imagine pre-Einstein physicists claiming "we're confused about space and time". People might have made arguments similar to yours in this post in response, saying things like "there are actually many physics equations which use the concepts of space and time to make accurate predictions" or "insofar as we're confused, it is best to focus on the confusions that seem most relevant to the specific problems that face us". But those kinds of arguments would have been mistaken.
The claim "we're confused about agency" is not a claim that our theory of agency is pre-paradigmatic. It's a claim that the current paradigm (of expected utility maximization) contains a bunch of anomalies and limitations, suggesting that there's a much better paradigm waiting to be found.
I think that EU maximization works quite well for ideal agents and I don't expect a very clean theory for bounded agents. Flaws in the EU maximization paradigm do not necessarily suggest that there is a better paradigm waiting to be found, because unlike in physics, I see no a priori reason to expect an elegant grand unified theory for bounded rationality. There can simply be different paradigms for modeling different aspects of agency. Clearly, some general principles and constraints will apply, and some connections will be discoverable. But it is not clear that there is any end to this process, or that we would need to find it to solve AI safety in practice.
This is a bad model of how scientific progress works. Instead, people should focus on the clearest anomalies/limitations as "clues" to finding a new paradigm that will change many aspects of how we think about agency.
Analogously, imagine pre-Einstein physicists claiming "we're confused about space and time". People might have made arguments similar to yours in this post in response, saying things like "there are actually many physics equations which use the concepts of space and time to make accurate predictions" or "insofar as we're confused, it is best to focus on the confusions that seem most relevant to the specific problems that face us". But those kinds of arguments would have been mistaken.
This analogy seems a bit tortured and I don't think the argument works as stated:
Generally, I find this comment fairly unhelpful. I was already aware that you disagreed with the position in this post (along with many others in the agent foundations community, which is why I wrote it), as well as your specific reasons (rejecting Bayesianism / EU maximization). Beyond signaling your disagreement, I don't think your comment engages with my argument, because both of your points seem to implicitly assume that I am wrong, rather than demonstrating that I am wrong.
both of your points seem to implicitly assume that I am wrong, rather than demonstrating that I am wrong.
You were portraying the "we're confused about agency" position as being that agency is "pre-paradigmatic". I think that's a mischaracterization, and corrected it. This doesn't implicitly assume that you're wrong.
However, I accept that my claim "this is a bad model of how scientific progress works" does implicitly rely on the idea that there's a clean new paradigm to be discovered for bounded agents. I didn't have that cached in my head as the core crux between us, but I'd phrase it differently now that I do.
Flaws in the EU maximization paradigm do not necessarily suggest that there is a better paradigm waiting to be found, because unlike in physics, I see no a priori reason to expect an elegant grand unified theory for bounded rationality.
Disagreed. In my experience, flaws in [X] paradigm usually suggests there is a better paradigm [Y] waiting to be found. Correspondence principles wherein the domain of validity of [X] is subsumed by [Y] are not limited to physics. Specifically, a comprehensive theory of bounded agency (say, with bound b) should recover a theory of non-bounded agency in the limit
On the object level, some of Einstein's gedankenexperiments on clock synchronization are pretty hard to distinguish from "a philosopher specializing in metaphysics had invented special relativity by reasoning from first principles about time".
More importantly, AI safety is fundamentally an engineering problem, not a scientific problem.
Disagreed. If it's an engineering problem, then I would expect that we must already deeply understand all the relevant principles -- so that it's just a matter of exchanging cost vs. safety, i.e. how much do we want to overengineer this? That does not characterize my epistemic state on AI safety in the least. In the moon landing analogy, I think AI safety research is arguably about at the level of the Tsiolkovsky rocket equation.
Specifically, a comprehensive theory of bounded agency (say, with bound b) should recover a theory of non-bounded agency in the limit
, otherwise it's probably incorrect!
This argument does not prove your point. It proves that if there is a clean theory of bounded agency, it should limit to some theory of unbounded agency. It does not prove that there must be a clean theory of bounded agency.
On the object level, some of Einstein's gedankenexperiments on clock synchronization are pretty hard to distinguish from "a philosopher specializing in metaphysics had invented special relativity by reasoning from first principles about time".
Interesting, link? But I think I stand by my point that the analogy is tortured.
Disagreed. If it's an engineering problem, then I would expect that we must already deeply understand all the relevant principles -- so that it's just a matter of exchanging cost vs. safety, i.e. how much do we want to overengineer this? That does not characterize my epistemic state on AI safety in the least. In the moon landing analogy, I think AI safety research is arguably about at the level of the Tsiolkovsky rocket equation.
I think engineering problems can have scientific subproblems, so I do not agree that we need to understand all of the relevant principles for AI safety to be an engineering problem (I guess I should be clear it's not a "mere" engineering problem, in the sense of conceptually straight-forward implementation, but how many hard engineering problems are like that in general?). Rather, it is an engineering problem because the ultimate objective is to achieve a desired effect, not to understand something better.
Non sequitur? I did not expect that statement to prove "there is a clean theory of bounded agency." It's a terse illustration of why correspondence principles aren't physics-magic. I am trying to explain why I believe we are disagreeing on priors here. It appeared to me that you believe elegant grand unified theories only appear in physics a priori.
https://www.fourmilab.ch/etexts/einstein/specrel/www/
Namely
I think that we are in the situation where we need to understand something (agency; metaethics) better to achieve the desired effect (safe superintelligence). I think the amount of scientific work to be done on the former far eclipses the engineering work of the latter, to the point that it's disingenuous to call the scientific work a subproblem. So it is more precise to say that I agree with you on the factual matter, but am dismayed by your rhetorical framing.
Non sequitur? I did not expect that statement to prove "there is a clean theory of bounded agency." It's a terse illustration of why correspondence principles aren't physics-magic. I am trying to explain why I believe we are disagreeing on priors here. It appeared to me that you believe elegant grand unified theories only appear in physics a priori.
If what I said was non-sequitur then I (still) do not understand what you are trying to say.
I think that it requires a pretty serious level of scholarship to confidently claim that everyone is confused about agency.
Yes, but, what is one supposed to do? I have investigated several diverse areas in hopes of finding relevant signal, and have largely come up empty. I'm just one guy, but if lots of other people do this, one can become fairly confident that the understanding we'd need just isn't there yet.
One general frustration I have with rationalist culture is the loner/outside attitude that emphasizes cleverness over scholarship, and generally underestimates the sometimes-slow but cumulative progress of academia.
I fairly strongly agree with this, and argue this to rationalists not too infrequently, though I would also make various counterpoints.
It is an overstatement because there is actually an extensive literature on agency from various perspectives (statistics, decision theory, economics, game theory, artificial intelligence, learning theory, algorithmic probability, etc.) which share many common powerful concepts and principles (such as probability theory).
This is true but it's really not that much evidence about the question at hand. The same could be said of biology BEFORE Darwin's theory, and BEFORE genomics and genomic regulatory networks and so on.
Though it is very difficult for one individual to study each of these areas in enough depth to see the connections clearly (and I don't claim to have achieved this yet), it is possible to do this,
What do you mean? What are you claiming is possible for one person to do?
and the existing knowledge seems very likely sufficient to explicate (most of) the necessary concepts.
Why on earth do you think this?
Also, what do you think are the necessary concepts? See for example this question (which I probably misstated), which I believe is more or less totally unaddressed (because it's hard): https://www.lesswrong.com/posts/NqsNYsyoA2YSbb3py/fundamental-question-what-determines-a-mind-s-effects
At least, it seems very hard to rule out that "agency" is relatively intricate and multi-faceted, rather than "confusing."
It's both.
It seems that "confusion about agency" is like "confusion about (biological) life," and this is not by accident. There are many underlying principles of life (such as Darwinian evolution). There are also many commonalities between (most?) forms of life, such as carrying genetic information encoded as DNA. There are common limitations on what life forms can achieve from thermodynamics. But there are also many diverse forms of life which use an array of distinct biological mechanisms, some more confusing (to us) than others. I think the same is true of agency; it appears in various forms, through various interesting mechanisms, which tend to solve common problems in similar ways, while constrained by certain hard limitations.
Without committing too much, I'll grant this as perhaps the best available analogy; then I'll ask: Wouldn't it be appropriate to say that you lack fundamental concepts needed to understand life, if you don't know about:
When I discussed the idea for this post with Abram Demski, he mentioned you as a notable example of someone who believes we are very confused about agency, so I was glad to see your response. In particular, I think you described to Abram some entirely missing field of study between AI, psychology, and linguistics? Have you written about this anywhere?
Yes, but, what is one supposed to do? I have investigated several diverse areas in hopes of finding relevant signal, and have largely come up empty. I'm just one guy, but if lots of other people do this, one can become fairly confident that the understanding we'd need just isn't there yet.
If you believe you have justified confidence that everyone is confused about agency then you should say so.
I fairly strongly agree with this, and argue this to rationalists not too infrequently, though I would also make various counterpoints.
Such as?
This is true but it's really not that much evidence about the question at hand. The same could be said of biology BEFORE Darwin's theory, and BEFORE genomics and genomic regulatory networks and so on.
This seems like a potential crux.
Biology before Darwin (and his contemporary ~indepedent co-inventors) was missing an (even partial) answer for some central questions, such as where species (and life) come from. Lamark already had a (false but sort of on track) explanation for the adaptation of species to their environments. If I applied the same standard to biology then that I am applying to agency now, would I have mistaken Lamark's explanation for a well-developed theory? I suspect that our understanding of agency is more formal, more complete, and more unified across fields than biology was in 1800, but I don't know enough of the history to say for sure.
Before Lamark biology seems to have lacked unifying principles to address its central questions. Lamark was a contemporary of Darwin, so the period where it is not clear whether "the same could be said" about biology seems to have been rather short.
What do you mean? What are you claiming is possible for one person to do?
Build a unified picture of agency from the perspectives of the fields that formally describe it, which "explains" agency about as well as evolution + genetics "explains" biology. Perhaps a little worse.
Why on earth do you think this?
The answer is a bit hard to compress to a comment, which is partially because I think the state-of-the-art "explanation of agency" does not have a very short compressed length. My meta-theory of rationality sequence might be the best answer I have written, but is not very direct. Another answer is that as I have studied rationality from various angles, I have tended to update in the direction that existing work has done a better job of modeling agency than I expected, and I am trying to get ahead of future updates. Also, I have a bundle of intuitions and models from theoretical computer science that suggest there should be not be elegant complete theories for bounded rationality.
Also, what do you think are the necessary concepts? See for example this question (which I probably misstated), which I believe is more or less totally unaddressed (because it's hard): https://www.lesswrong.com/posts/NqsNYsyoA2YSbb3py/fundamental-question-what-determines-a-mind-s-effects
Initially, your post asks for a theory of agency, but later it shifts towards asking for a theory of alignment (or control/obedience).
"What determines a mind's effects" does not seem likely to have a clean answer in general. What determines the trajectory of embryonic development? What determines the trajectory of an ecosystem? In your words: "It may be that there is no answer. It may be that no small assembly of small elements of the mind comprehensively determines the mind's effects."
I am not sure that there is a finite set of concepts for understanding agency. The problem with a theory of agents is that agents invent theory, so agent theory keeps pulling more stuff into itself. I guess that the point of agent foundations is to focus on only the core theory of agency. But what characterizes the core theory? If Bayesian probability is part of it, why stop at the definition of updating and not develop the theory of exponential families (without end)? I think the usual unstated frame is that the core is the necessary part to build an agent that recursively self-improves. Unfortunately, I suspect it is likely possible to do this with very little theory (e.g. the current trajectory of ML). That seems unsatisfying, so perhaps one hopes for theoretically justified self-improving agent. I think we have reasonable good concepts for discussing this (the recursion theorem, Levin/Hutter search, logical induction and various other approaches to reflection in decision theory, maybe IB). Unfortunately, self-improving systems from theory tend to be impractical relative to hackier ML approaches (but practice is usually hackier than theory, so this is not really surprising or confusing).
However, building more practical theoretically-justified agents is not clearly sufficient for solving AI safety! Even theoretically justified agents tend to become opaque as they run (e.g. AIXI approximations). But that problem is about communication, not agency in general. In other words, the thing that we really want for AI safety is not centrally deconfusion about agency. It's more like deconfusion about a specific agent :)
But you say
Without first understanding how the effects of minds in general are determined, designing a mind to have specifiable effects is out of order. If the designer understands how to design the mind so that its effects are specifiable, then the designer also understands how the specification channel determines the mind's effects.
I don't agree with this. It is often easier to design some X that works predictably, than to verify whether an arbitrary X works. For example, we can write verifiable programs, but we can't verify arbitrary programs.
If your claim is that we do not have the necessary concepts to perform AI alignment, then I agree. We are confused about AI alignment. This statement is narrower and more useful than "We are confused about agency."
Before Lamark biology seems to have lacked unifying principles to address its central questions.
Before Linnaeus it lacked unifying principles.... But after Linnaeus, we have categorized everything.
...By which I mean to suggest, I think we might be miscommunicating about the questions at hand... If a post-Linnaean biologist is looking back and saying "sure, we had lots of information about species, but we didn't have a conceptual scheme; now we do, so we are no longer confused about biology", then the problem isn't that ze is wrong that a big advance happened, but rather that ze is... failing to imagine that there could be major future insights, after which the prior state of knowledge would seem fundamentally conceptually impoverished.
I think that a biologist in 1900 is confused about several fundamental things about life. For example, they don't know about GRNs. My guess would be that biologists in 2026 are also confused that way. I wouldn't know specifically what they are confused about, but for example they may lack concepts of [metastable self-reinforcing gene regulatory states] or [bundle structures within the space of biological functions corresponding to characters such as cross-species-hands and cross-species-eyes and cross-species-T-cells]. (You would definitely find papers about both of those topics, and probably lots of other interesting theoretical topics in biology, but I'm saying it would be unsatisfactory, where it could eventually be satisfactory.)
I don't agree with this. It is often easier to design some X that works predictably, than to verify whether an arbitrary X works. For example, we can write verifiable programs, but we can't verify arbitrary programs.
Your interpretation of this paragraph is reasonable, and I agree that the statement you're hearing is incorrect; but rereading the whole section ( https://www.lesswrong.com/posts/NqsNYsyoA2YSbb3py/fundamental-question-what-determines-a-mind-s-effects#The_word__a_), I largely stand by what I wrote--that paragraph functions in context as part of a development of thought that is dealing with the tension between (1) we just need to design one mind that works, and (2) you can't really do that without understanding minds somewhat more generally (IDK how much).
This statement is narrower and more useful than "We are confused about agency."
I think it loses a pretty important + useful idea, which is that it's NOT the case that:
we understand agents and [their goals and how they relate to their goals or something], as long as you're not trying to change [their goals and how they relate to their goals or something].
"What determines a mind's effects" does not seem likely to have a clean answer in general.
I can quibble with this, and of course it's an open question (e.g. cf. https://www.lesswrong.com/posts/NvwjExA7FcPDoo3L7/are-there-cognitive-realms). But I certainly take your point that you maybe don't need to understand all or most minds. (I may have written something contradicting that in the "fundamental question" post; if I did, then I was mistaken or wrote unclearly or something.)
I am not sure that there is a finite set of concepts for understanding agency. The problem with a theory of agents is that agents invent theory, so agent theory keeps pulling more stuff into itself.
I think I agree with this. Cf. https://www.lesswrong.com/posts/nkeYxjdrWBJvwbnTr/an-advent-of-thought
I guess that the point of agent foundations is to focus on only the core theory of agency. But what characterizes the core theory?
I would say that alignment is pretty close to the core theory; let's say, 1/4 of core theory is alignment, and 1/2 of alignment is core theory (numbers made up, but just to give relative sizes or something). Or IDK haha. But what I mean is, core theory would tend to be
If something tends to get self-modified away, it's probably not core.
You might say that nothing is stable, everything gets self-modified away? We could debate that. But I have a perhaps not very accountable sense that it is possible for me to decide some things about myself permanently, even though I'm fairly gung-ho (in the long run) about radically changing / growing / transcending myself. E.g. I think I can decide to never pointlessly torture a person--I mean presumably someone could fairly feasibly mess with my head a bunch to change that property, but I mean that left to my own RSI devices I would never change my mind on that. Do you agree that some things can be stable like that?
"What determines a mind's effects" does not seem likely to have a clean answer in general. What determines the trajectory of embryonic development? What determines the trajectory of an ecosystem? In your words: "It may be that there is no answer. It may be that no small assembly of small elements of the mind comprehensively determines the mind's effects."
Yeah, not in fully generality. IDK how generally it should have an answer. I think embryonic development, and especially an ecosystem, are bad comparisons because they aren't sufficiently mind/agent-like. The embryo is executing a bunch of narrow adaptations (various homeostats, and similar, I imagine); it isn't on a very determined trajectory, and it can be fairly easily disrupted, and it won't pull itself back onto the trajectory. Even more so for the ecosystem. A mind/agent, on the other hand, can aim itself--at least to some pretty large degree.
Also, I have a bundle of intuitions and models from theoretical computer science that suggest there should be not be elegant complete theories for bounded rationality.
I agree with this for at least two reasons, one of them being "The concrete is never lost" (https://tsvibt.blogspot.com/2025/11/ah-motiva-2-relating-values-and-novelty.html#values-require-reference). But I don't view this as bearing much on questions like "can we get a major compression of the important elements of the domain". All complex, context-embedded things will have a bunch of complication that doesn't reduce, but that doesn't mean you can't much much more deeply understand things. E.g. the OS of a laptop computer ~irreducibly must have complexity to interface with mouse, keyboard, screen, wifi, audio, different memory elements, etc.; but it's still very principled (written in multiple layers of languages, using algorithmic ideas for high leverage) and is much more understandable than a huge pile of machine code that some compiler spits out (or however that works).
Build a unified picture of agency from the perspectives of the fields that formally describe it, which "explains" agency about as well as evolution + genetics "explains" biology. Perhaps a little worse.
Oh. Well that's kind of a low bar. Maybe we don't disagree about this then, not sure. We agree that it's not nearly enough for alignment, right?
Right, but I think it's not enough in the sense that we need to develop the specific concepts which are relevant to alignment.
Maybe, but I still think it's a strategic mistake to aim at the center of "what a mind is" when you want to hit the center of (for example) "what trust is."
Because of the question of reflective (in)stability in general, I think it's quite hard to get a handle on anything really important in a mind other than by really understanding mind/agency. Otherwise you have no coordinates for what the mind "really is" in the sense of what elements of the mind will actually stick around.
In particular, I think you described to Abram some entirely missing field of study between AI, psychology, and linguistics? Have you written about this anywhere?
Not really (as I moved out of AGI alignment research); and if I did, it would be a speculative gestural description, rather than examples, because I don't actually know the theory. A distant spiritual cousin might be Eisenstat's Condensation (https://www.lesswrong.com/posts/BstHXPgQyfeNnLjjp/condensation). The closest thing I can point you to is:
But that's only obliquely related IIRC.
If you believe you have justified confidence that everyone is confused about agency then you should say so.
Pretty confident, yeah.
I fairly strongly agree with this, and argue this to rationalists not too infrequently, though I would also make various counterpoints.
Such as?
The main thing would be that it can be highly beneficial to work things out for yourself, even if it's much slower in the obvious sense that you learn the set of known ideas slower. Other points would be that indeed sometimes the literature has missed important questions or ideas; or even if someone did discuss it it's too hard to find because the literature didn't signal boost it properly; and often academics are uncomfortable with too speculative topics when they could just as well have been excited about them. There's probably more, IDK. But generally I agree that rationalists tend to be dismissive of academia (e.g. failing to update that if academia didn't do X, that's often fairly strong evidence that X is less interesting and/or more difficult than you think it is).
I agree with the counterarguments people have commented here.
I nonetheless have also felt something is off about how people say "we are confused about agency" or similar. It does often feel like a bit of a cultural signal. And sometimes it feels... idk, some flavorcombo of "giving up" and "committed to a particular frame."
I think it maybe feels particularly off when particular people say it a particular way, and I don't feel up for picking the fight about that right now, but, wanted to flag that it feels like there is at least something to this post, despite me agreeing with much of Richard and Tsvi's takes.
I resonate with the vibe of this comment, despite recognizing that we probably have some object-level disagreements.
Note that I only put 65% confidence on the phrase "we are confused about agency" actually being "wrong" in the sense of incorrectly describing our epistemic state.
When I say I am confused about agency, I mean that I am confused about how to design an agent that has quantifiable, useful properties. A nuclear engineer is not confused about nuclear physics and therefore is able to design a power plant for which they know exactly the energy output, emitted radiation, etc., and can predict potential failure modes and prepare for them. I want to have that kind of clarity.
I have no objection to that, but saying you are confused about agency does not seem like a precise way of communicating it.
Can you take physics of our world and rule out on theoretical grounds that physics is agent which optimizes for trajectories with the least action?
People are confused about agency because they are confused about agents, that is, what it means to be a separate entity that can act. The trouble is that we see no evidence for metaphysical or even physical separation. Instead, separation is ontological, a product of how we frame our observations of the world. And so if you go looking for an objective, ontology-independent understanding of agents and agency, you get confused because you notice the separation is in the map and not inherent to the territory.
I think that's why people say it. They are genuinely confused.
This proves too much. If taken literally, it would imply that there are no differences, if separation was purely ontological. But there are clearly differences. Separations being partial, continuous, (spatial or temporal)-frequency-specific, or etc, doesn't mean there's no separation, or that it's just a modeling choice. There's an objective gap in the bond strength of molecules in a chair-floor system that makes the chairfloor naturally factor into chair and floor. if one could view the histogram of bond strength, one might be able to find others. And there might be some bonds formed between chair and floor, maybe a few strong bonds. But there's an objective difference; a theory which insists no qualitative separations misses that, quantitatively, there are. Dismissing separability of stuff specifically for agents seems like a mistake to me, though I do agree there are many appealing-but-false ways to describe separability of agentic stuff.
We're both chunks of matter in reality - arguable exactly how much of our environment to include in self-chunk but definitely still separate chunks, even in ontologies with continuous-valued identity - and we're within at most 60 light milliseconds of each other, but the information flow within our brains is vastly higher than info flow through these comments. It doesn't matter what ontology you use to describe that - as long as you describe it in enough detail, your description must end up including the gap in connection strengths of various kinds.
Maybe your description doesn't call out the largest gap, but I claim there's an objective sense in which indra's net is not fully connected, even though every connection exists, in a similar sense to how a fully connected weighted graph
But there are clearly differences.
To be a bit pedantic (and rhetorical), how do you know? You have some model of the world. When you say there are clearly differences, you are saying this from the standpoint of your world model. In general this is not something I'd need to call out, but here it matters, because the very idea of separation is inherent to the very idea of modeling. If you don't model, then there's no separation, because there's nothing to do the sorting into categories. You may hold a contingent belief in your model that, in the counterfactual where you and no one else did any modeling there would still be separation, but this is impossible to consider, because to consider it itself requires modeling the world.
(Tangental, but to be clear I would consider "modeling the world" a fairly basic act that happens any time information is created by a control system. The human version of this is just dramatically more complex than the 1 bit model a thermostat has.)
None of this is to say that modeling the world as if separation exists prior to ontology is not useful. In fact, it will lead to more accurate predictions and you should do it! The trouble is that this doesn't suddenly let us jump to assuming that separation exists independent of ontology because we have no non-ontological way to observe it. This leaves assertions about the existence (or nonexistence) of separation beyond ontological metaphysical claims about which we cannot know.
Thus why I argue that the idea of a separate agent is confusing. Perhaps I was a bit too strong in saying that we see "no evidence for [...] physical separation", as I see how without laying out the above argument this could be interpreted as an argument against the value of physical modeling, which was not my intent.
I don't feel particularly confused about what it means to be an agent. I certainly have no problem pointing out agents in the world. The category has some edge cases, like all other categories. I don't really expect a very precise definition for agents to be either possible or necessary for AI safety. The line of argument through "what is an agent?" is one of the least persuasive to me, for working on agent foundations.
I am more confused about how agents work, but that is a different issue.
How do you define an agent?
Which of the following entities are agents; if they are agents, what kind?:
2. a rocket actively steered by a Kalman filter
3. the membrane of an e. Coli
4. nematode worm
5. An ant swarm
6. a dog
7. a human
8. a human that is alert and focused
9. Amazon, the company
10. A humanoid robot
11. A robot arm
12. the United Kingdom
13. Claude 4.2
14. Claude 4.2 being called by an 'agentic workflow' app.
15. the process of Evolution by natural selection
17. all of humanity over the last thousand years.
18. AIXI
19. a Solomonoff inductor
20. A logical inductor
21. a Szilard engine, i.e. Maxwell demon
22. etc
Agents are an empirical cluster that share certain properties such as learning and planning to steer towards narrow targets. I don't think there is a natural sharp definition. I'd prefer to discuss the agent-like (safety-relevant) properties of particular algorithms and instantiations, rather than categorize edge cases.
But for fun,
I agree that we have strong intuitions about agents and agency. The problem is, those intuitions are difficult to make rigorous. The attempt to make the intuitions rigorous is the path down which confusion lies. If you forgo formal rigor, there need be no confusion, but one gives up much by doing so in the domain of AI safety.
That would be true for any AI safety plan that relies on a precise definition of agency. I agree that some safety approaches (like direct cosmopolitan value alignment) would probably need this, but I mostly don’t like those plans and think of agents as an empirical cluster.
I think you're right that "confused" is the wrong word, but this comes from one specific mistake: attributing agentic qualities to systems that don't actually exhibit these qualities.
Agency implies self-sufficiency. One that persists, maintains itself, sets its own goals, and acts without coercion. Thermostats act independently, but we don't call them agents. They don't set their own goals. They respond when prompted. Within the infrastructure that humans built, maintain, and direct.
When the discourse treats these systems as if they're approaching genuine agency, it suggests assumptions about autonomy, self-sufficiency and the need for corrigibility. But that may not apply to something that was never self-sufficient in the first place. You don't make a hammer corrigible.
We don't lack a theory of agency, but we're pattern matching systems into an agent frame where they don't meet the criteria. Your post ends with the quote "The overall discourse should be improved." I agree. That's where we should start, but not end. If we stop projecting our own qualities on anything that generates text, a lot of the confusion you're describing would dissolve.
Epistemic status: Speculation on agent foundations research culture (which I am pretty deeply engaged with) and whether "we are confused about agency" which I am not sure about. I will take for granted that this is a common refrain, which should be familiar to anyone who is part of the relevant scene.
The phrase "We are confused about agency," often with variations such as "way too," "deeply," or "dangerously" is a common membership signal for a certain AI safety research culture. Roughly speaking, this is the culture of the agent foundations research program that accreted around MIRI.[1] The phrase is usually supplied as an argument for delaying the development of A.I. until certain mathematical research (particularly in learning/decision theory) has been carried out.
I find the phrase uncomfortable on various levels.
As Cultural Signal
Since I claim that the phrase (apart from its literal meaning) functions as an in-group signal, it is natural to wonder what "we" refers to here.
I believe the intention is "we = everyone." I think that it requires a pretty serious level of scholarship to confidently claim that everyone is confused about agency. One general frustration I have with rationalist culture is the loner/outside attitude that emphasizes cleverness over scholarship, and generally underestimates the sometimes-slow but cumulative progress of academia.[2] I will discuss in later sections whether the phrase is true under the (typically intended?) "we = everyone" interpretation, which is treated as the default outside of this section.
Another natural interpretation is "we = humanity," which is substantially different because humanity is frequently confused about something that many individuals (and often whole professions) understand clearly. This would suggest drastically different interventions (e.g. education, communication). Interestingly, MIRI is now focused on such interventions. Arguably, MIRI is sufficiently less confused about agency than humanity to at least recognize that no one is prepared to build artificial superintelligence (ASI) safely. I think that this meaning usually is not intended, because the phrase is usually offered to support the relevance of some particular research agenda, not a communication effort.
Alternatively, "we" could mean the agent foundations community, or just "me," which would be a narrower but not necessarily more correct claim. Ironically, I think that something like the opposite of this interpretation of the phrase is often load-bearing: "The frontier labs, as opposed to us, are too confused about agency to be trusted with ASI."
Confused in what Sense?
The word choice "confused" is part of the in-group signal, recalling Noticing Confusion.
Is it really appropriate here? I feel that it does not quite fit.
Confusion goes beyond alternative word choices such as "uncertainty" or "ignorance." It suggests that we do not even have the right concepts to begin talking about agency (which is roughly what John Wentworth argues here).
I think this is not only an overstatement, but probably (at ~65% confidence) just an incorrect description of our epistemic state.
It is an overstatement because there is actually an extensive literature on agency from various perspectives (statistics, decision theory, economics, game theory, artificial intelligence, learning theory, algorithmic probability, etc.) which share many common powerful concepts and principles (such as probability theory). Though it is very difficult for one individual to study each of these areas in enough depth to see the connections clearly (and I don't claim to have achieved this yet), it is possible to do this, and the existing knowledge seems very likely sufficient to explicate (most of) the necessary concepts. It is hard for me to view this epistemic state as per-paradigmatic. At least, it seems very hard to rule out that "agency" is relatively intricate and multi-faceted, rather than "confusing."
My stronger claim is that confusion is the wrong description not only in degree but in kind. It seems that "confusion about agency" is like "confusion about (biological) life," and this is not by accident. There are many underlying principles of life (such as Darwinian evolution). There are also many commonalities between (most?) forms of life, such as carrying genetic information encoded as DNA. There are common limitations on what life forms can achieve from thermodynamics. But there are also many diverse forms of life which use an array of distinct biological mechanisms, some more confusing (to us) than others. I think the same is true of agency; it appears in various forms, through various interesting mechanisms, which tend to solve common problems in similar ways, while constrained by certain hard limitations. For more details, see my "meta-theory of rationality" sequence.
Confused about what?
While I would not say that we are confused about agency, I think we have a lot of specific confusions about certain aspects of agency. It is best to narrowly characterize which ones we are talking about. If the goal of agent foundations research is to solve AI safety, it is best to focus on the confusions that seem relevant to that effort.[3]
For example, a theory of communication and concepts may aid value alignment and interpretability. It is easier to design a powerful but opaque learning algorithm than it is to design a powerful learning algorithm with a knowledge base we can actually read. I think we are still confused about exactly what that would even mean, and even whether it is possible (and in what sense).
From a decision-theoretic standpoint, corrigibility and other forms of unambitiousness also seem difficult and maybe even confusing, but essential to ASI going well in practice. What does it mean for one system to reliably allow correction by another?
In fact, all of these topics seem to fall under a sort of general category that Abram Demski might call "understanding trust." Roughly speaking, communication is about our trust in an A.I. system, and corrigibility is about its trust in us. The precise nature of this connection, in itself, also seems confusing (to me).
I think that experienced agent foundations researchers have already internalized these lessons, and the established agendas I am aware of pretty much all carve off some particular confusion about agency which seems relevant to safety.[4] But the overall discourse should be improved.
Gretta Duleba has described it as the "The Alignment-Is-Hard Cluster," but I am pretty sure that the phrase "We are confused about agency" is better Bayesian evidence for membership than "Alignment is hard."
Dan Murfet impressed this on me.
Of course following curiosity is a useful local research strategy. But this line of argument frequently justifies cope.
Vanessa Kosoy and Alexander Appel come to mind as exceptions in the sense that they seem to be attacking overall confusion about agency pretty much directly. But they're very good at it, and I think that they (in particular) should keep doing what they are doing. The correct number of researchers taking this path is not zero.