My impression is that a ton of work at MIRI (and some related research lines in other places) went into answering this question, and indeed, no one knows the answer very crisply right now and yup that's alarming.
See John Wentworth's post on Why Agent Foundations? An Overly Abstract Explanation, which discusses the need to find the True Name of agents.
(Also, while I agree agents are "more mysterious than rocks or The Odyssey", I'm actually confused why the circularity is particularly the problem here. Why doesn't the Odyssey also run into the Abstraction for Whom problem?)
I usually think of this in terms of Dennett's concept of the intentional stance, according to which there is no fact of the matter of whether something is an agent or not. But there is a fact of the matter of whether we can usefully predict its behavior by modeling it as if it was an agent with some set of beliefs and goals.
For example, even though the calculations of a chess-playing computer have practically nothing in common with human thought, its moves can still be effectively predicted by assuming that it “wants” to win at chess and “knows” the rules of chess. This gives rise to the prediction that it will always choose, from the list of viable moves, one which best furthers the goal of winning the game. Even though the best move may not be obvious, adopting the intentional stance still allows the human observer to improve on their predictions of what the computer would do, by eliminating obvious bad moves.
There is no observer-independent “fact of the matter” of whether a system is or is not an “agent”. However, there is an objective fact of the matter about how well-modeled a particular system’s behavior is modeled by the intentional stance, from the point of view of a given observer. There are, objectively, patterns in the observable behavior of an intentional system that correspond to what we call “beliefs” and “desires”, and these patterns explain or predict the behavior of the system unusually well (but not perfectly) for how simple they are. [...]
There are several approaches one might take to predicting the future behavior of some system; Dennett compares three: the physical stance, the design stance, and the intentional stance.
In adopting the physical stance towards a system, you utilize an understanding of the laws of physics to predict a system’s behavior from its physical constitution and its physical interactions with its environment. One simple example of a situation where the physical stance is most useful is in predicting the trajectory of a rock sliding down a slope; one would be able to get very precise and accurate predictions with knowledge of the laws of motion, gravitation, friction, etc. In principle (and presuming physicalism), this stance is capable of predicting in full the behavior of everything from quantum mechanical systems to human beings to the entire future of the whole universe.
With the design stance, by contrast, “one ignores the actual (possibly messy) details of the physical constitution of an object, and, on the assumption that it has a certain design, predicts that it will behave as it is designed to behave under various circumstances.” For example, humans almost never consider what their computers are doing on a physical level, unless something has gone wrong; by default, we operate on the level of a user interface, which was designed in order to abstract away messy details that would otherwise hamper our ability to interact with the systems.
Finally, there’s the intentional stance:
Here is how it works: first you decide to treat the object whose behavior is to be predicted as a rational agent; then you figure out what beliefs that agent ought to have, given its place in the world and its purpose. Then you figure out what desires it ought to have, on the same considerations, and finally you predict that this rational agent will act to further its goals in the light of its beliefs. A little practical reasoning from the chosen set of beliefs and desires will in many—but not all—instances yield a decision about what the agent ought to do; that is what you predict the agent will do.
Before further unpacking the intentional stance, one helpful analogy might be that the three stances can be understood as providing gears-level models for the system under consideration, at different levels of abstraction. For purposes of illustration, imagine we want to model the behavior of a housekeeping robot:
- The physical stance gives us a gears-level model where the gears are the literal gears (or other physical components) of the robot.
- The design stance gives us a gears-level model where the gears come from the level of abstraction at which the system was designed. The gears could be e.g. the CPU, memory, etc., on the hardware side, or on the level of the robot’s user interface, on the software side.
- The intentional stance gives us a gears-level model where the relevant gears are the robot’s beliefs, desires, goals, etc. [...]
Now that he’s described how we attribute beliefs and desires to systems that seem to us to have intentions of one kind or another, “the next task would seem to be distinguishing those intentional systems that really have beliefs and desires from those we may find it handy to treat as if they had beliefs and desires.” (For example, although a thermostat’s behavior can be understood under the intentional stance, most people intuitively feel that a thermostat doesn’t “really” have beliefs.) This, however, cautions Dennett, would be a mistake.
As a thought experiment, Dennett asks us to imagine that some superintelligent Martians descend upon us; to them, we’re as simple as thermostats are to us. If they were capable of predicting the activities of human society on a microphysical level, without ever treating any of us as intentional systems, it seems fair to say that we wouldn’t “really” be believers, to them. This shows that intentionality is somewhat observer-relative—whether or not a system has intentions depends on the modeling capabilities of the observer.
However, this is not to say that intentionality is completely subjective, far from it—there are objective patterns in the observables corresponding to what we call “beliefs” and “desires.” (Although Dennett is careful to emphasize that these patterns don’t allow one to perfectly predict behavior; it’s that they predict the data unusually well for how simple they are. For one, your ability to model an intentional system will fail under certain kinds of distributional shifts; analogously, understanding a computer under the design stance does not allow one to make accurate predictions about what it will do when submerged in liquid helium.) [...]
If something appears agent-y to us (i.e., we intuitively use the intentional strategy to describe its behavior), our next question tends to be, “but is it really an agent?” (It’s unclear what exactly is meant by this question in general, but it might be interpreted as asking whether some parts of the system correspond to explicit representations of beliefs and/or desires.) In the context of AI safety, we often talk about whether or not the systems we build “will or won’t be agents,” whether or not we should build agents, etc.
One of Dennett’s key messages with the intentional stance is that this is a fundamentally confused question. What it really and truly means for a system to “be an agent” is that its behavior is reliably predictable by the intentional strategy; all questions of internal cognitive or mechanistic implementation of such behavior are secondary. (Put crudely, if it looks to us like an agent, and we don’t have an equally-good-or-better alternative for understanding that system’s behavior, well, then it is one.) In fact, once you have perfectly understood the internal functional mechanics of a system that externally appears to be an agent (i.e. you can predict its behavior more accurately than with the intentional stance, albeit with much more information), that system stops looking like “an agent,” for all intents and purposes. (At least, modeling the system as such becomes only one potential model for understanding the system’s behavior, which you might still use in certain contexts e.g. for efficient inference or real-time action.)
We should therefore be more careful to recognize that the extent to which AIs will “really be agents” is just the extent to which our best model of their behavior is of them having beliefs, desires, goals, etc. If GPT-N appears really agent-y with the right prompting, and we can’t understand this behavior under the design stance (how it results from predicting the most likely continuation of the prompt, given a giant corpus of internet text) or a “mechanistic” stance (how individual neurons, small circuits, and/or larger functional modules interacted to produce the output), then GPT-N with that prompting really is an agent.
Before I got to the point in my education where I learned what the CPU has eaten it seemed that software programming languages had a ladder of more abstract and more concredte languages but it seemed it was just an issue of translating one language to the other. The primitive "takes orders" capacity seemed mysterious how it could ever appear or be explained in the hierachy. The beauty of learning what a primitive computer was like is in that none of the parts "take orders", its the software that is done entirely in hardware.
But processors are extrenally driven. For agents I suspect the core property is auto-poesis ie being run from signals emerging from within. Circuits will do some computation when excited but then "sleep" if the enviornemnt is not actively pushing in. Computers can keep up the excitation but will do essentially the same pattern unless disturbed from outside. Agents are the things that keep on changing their pattern even if the environment leaves them alone (or their evolution is because of the echo they make into the environment).
There is a sense in which agency is a fundamental concept. Before we can talk about physics, we need to talk about metaphysics (what is a "theory of physics"? how do we know which theories are true and which are false?). My best guess theory of metaphysics is infra-Bayesian physicalism (IBP), where agency is a central pillar: we need to talk about hypotheses of the agent, and counterfactual policies of the agent. It also looks like epistemic rationality is inseparable from instrumental rational: it's impossible to do metaphysics without also doing decision theory.
Does this refute reductionist materialism? Well, it depends how you define "reductionist materialism". There is a sense in which IBP is very harmonious with reductionist materialism, because each hypothesis talks about the universe from a "bird's eye view", without referring to the relationship of the agent with the universe (this relationship turns out to be possible to infer using the agent's knowledge of its own source code), or even assuming any agent exists inside the universe described by the hypothesis. But, the agent is still implicit in the "whose hypothesis".
Once we accept the "viewpoint agent" (i.e. the agent who hypothesizes/infers/decides) as fundamental, we can still ask, what about other agents? The answer is: other agents are programs with high value of (see Definition 1.6 in the IBP article) which the universe is "running" (this is a well-defined thing in IBP). In this sense, other agents are sort of like rocks: emergent from the fundamental reductionist description of the universe. However, there's a nuance: this reductionist description of the universe is a belief of the viewpoint agent. The fact it is a belief (formalized as a homogeneous ultradistribution) is crucial in the definition. So, once again, we cannot eliminate agency from the picture.
The silver lining is that, even though the concept of which programs are running is defined using beliefs, i.e. requires a subjective ontology, it seems likely different agents inhabiting the same universe can agree on it (see subsection "are manifest facts objective" in the IBP article), so there is a sense in which it is objective after all. Decide for yourself whether to call this "reductionist materialism".
Since you switched the moderation to "easy-going"...
I have hinted at a definition in an old post https://www.lesswrong.com/posts/NptifNqFw4wT4MuY8/agency-is-bugs-and-uncertainty. Basically we use agency as a black-box description of something.
Of course, as generally agreed, agency is a convenient intentional stance model. There is no agency in a physical gears-level description of a system.
But this is circular. An abstraction for whom? What even is an abstraction, when you're in the process of defining an agent? Is there some agent-free definition of an abstraction implicitly being invoked here?
To build it up from the first principles, we must start with a compressible (not fully random) universe, at a minimum, because "embedded agents", whatever they might turn out to be, are defined by having a somewhat accurate (i.e. lossily compressed) internal model of the world, so some degree of compressibility is required. (Though maybe useful lossy compression of a random stream is a thing, I don't know.)
Next, one would identify some persistent features of the world that look like they convert free energy into entropy (note that a lot of "natural" systems behave like that, say, stars).
Finally, merging the two, a feature of the world that contains what appears to be a miniature model of the (relevant part of the) world, which also converts energy into entropy to persist the model and "itself" would be sort of close to an "agent".
There are plenty of holes in this outline, but at least there is no circularity, as far as I can tell.
One possible definition is to look for things which are more optimized than any simple mechanism you can imagine for performing a task. So, e.g. Kasparov is great at playing chess, as an amateur you can verify this by noting that any plan you can come up with will tend to do worse than Kasparov's plans(with high probability). In some sense this is an observer-relative definition, but it can be made more objective by considering the minimally-complex program that can match a given level of performance on a task, parameterized by e.g. Levin complexity. See this comment.