Most models of agency (in game theory, decision theory, etc) implicitly assume that the agent is separate from the environment - there is a “Cartesian boundary” between agent and environment. The embedded agency sequence goes through a long list of theoretical/conceptual problems which arise when an agent is instead embedded in its environment. Some examples:
- No defined/input output channels over which to optimize
- Agent might accidentally self-modify, e.g. drop a rock on its head
- Agent might intentionally self-modify, e.g. change its own source code
- Hard to define hypotheticals which don’t actually happen, e.g. “I will kill the hostages if you don’t pay the ransom”
- Agent may contain subcomponents which optimize for different things
- Agent is made of parts (e.g. atoms) whose behavior can be predicted without thinking of the agent as agenty - e.g. without thinking of the agent as making choices or having beliefs
- Agent is not logically omniscient: it cannot know all the implications of its own beliefs
The embedded agency sequence mostly discusses how these issues create problems for designing reliable AI. Less discussed is how these same issues show up when modelling humans - and, in particular, when trying to define human values (i.e. “what humans want”). Many - arguably most - of the problems alignment researchers run into when trying to create robust pointers to human values are the same problems we encounter when talking about embedded agents in general.
I’ll run through a bunch of examples below, and tie each to a corresponding problem-class in embedded agency. While reading, bear in mind that directly answering the questions posed is not the point. The point is that each of these problems is a symptom of the underlying issue: humans are embedded agents. Patching over each problem one-by-one will produce a spaghetti tower; ideally we’d tackle the problem closer to the root.
The Keyboard is Not The Human
Let’s imagine that we have an AI which communicates with its human operator via screen and keyboard. It tries to figure out what the human wants based on what’s typed at the keyboard.
A few possible failure modes in this setup:
- The AI wireheads by seizing control of the keyboard (either intentionally or accidentally)
- A cat walks across the keyboard every now and then, and the AI doesn’t realize that this input isn’t from the human
- After a code patch, the AI filters out cat-input, but also filters out some confusing (but important) input from the human
Embedded agency problem: humans do not have well-defined output channels. We cannot just point to a keyboard and say “any information from that keyboard is direct output from the human”. Of course we can come up with marginally better solutions than a keyboard - e.g. voice recognition - but eventually we’ll run into similar issues. There is nothing in the world we can point to and say “that’s the human’s output channel, the entire output channel, and nothing but the output channel”. Nor does any such output channel exist, so e.g. we won’t solve the problem just by having uncertainty over where exactly the output channel is.
Because humans are embedded in the physical world, there is no fundamental block to an AI modifying us (either intentionally or unintentionally). Define what a “human” is based on some neural network which recognizes humans in images, and we risk an AI modifying the human by externally-invisible means ranging from drugs to wholesale replacement.
Embedded agency problem: no Cartesian boundary. All the human-parts can be manipulated/modified; the AI is not in a different physical universe from us.
Human choices can depend on off-equilibrium behavior - what we or someone else would do, in a scenario which never actually happens. Game theory is full of examples, especially threats: we don’t launch our nukes because we expect our enemies would launch their nukes… yet what we actually expect to happen is for nobody to launch any nukes. Our own behavior is determined by “possibilities” which we don’t actually expect to happen, and which may not even be possible. Embedded agency problem: counterfactuals.
Going even further: our values themselves can depend on counterfactuals. My enjoyment of a meal sometimes depends on what the alternatives were, even when the meal is my top pick - I’m happier if I didn’t pass up something nearly-as-good. We’re often unhappy to be forced into a choice, even if it’s a choice we would have made anyway. What does it mean to “have a choice”, in the sense that matters for human values? How do we physically ground that concept? If we want a friendly AI to allow us choices, rather than force us to do what’s best for us, then we need answers to questions like these.
Humans have different preferences while drunk than while sober [CITATION NEEDED]. When pointing an AI at “human values”, it’s tempting to simply say “don’t count decisions made while drunk”. But on the other hand, people often drink to intentionally lower their own inhibitions - suggesting that, at a meta-level, they want to self-modify into making low-inhibition decisions (at least temporarily, and within some context, e.g. at a party).
Embedded agency problem: self-modification and robust delegation. When a human intentionally self-modifies, to what extent should their previous values be honored, to what extent their new values, and to what extent their future values?
Humans generally have different values in childhood, middle age, and old age. Heck, humans have different values just from being hangry! Suppose a human makes a precommitment, and then later on, their values drift - the precommitment becomes a nontrivial constraint, pushing them to do something they no longer wish to do. How should a friendly AI handle that precommitment?
Embedded agency problem: tiling & delegation failures. As humans propagate through time, our values are not stable, even in the absence of intentional self-modification. Unlike in the AI case, we can’t just design humans to have more stable values. (Or can we? Would that even be desirable?)
Humans have subsystems. Those subsystems do not always want the same things. Stated preferences and revealed preferences do not generally match. Akrasia exists; many people indulge in clicker games no matter how much some other part of themselves wishes they could be more productive.
Embedded agency problem: subsystem alignment. Human subsystems are not all aligned all the time. Unlike the AI case, we can’t just design humans to have better-aligned subsystems - first we’d need to decide what to align them to, and it’s not obvious that any one particular subsystem contains the human’s “true” values.
Preferences Over Quantum Fields
Humans generally don’t have preferences over quantum fields directly. The things we value are abstract, high-level objects and notions. Embedded agency problem: multi-level world models. How do we take the abstract objects/notions over which human values operate, and tie them back to physical observables?
At the same time, our values ultimately need to be grounded in quantum fields, because that’s what the world is made of. Human values should not seemingly cease to exist just because the world is quantum and we thought it was classical. It all adds up to normality. Embedded agency problem: ontological crises. How do we ensure that a friendly AI can still point to human values even if its model of the world fundamentally shifts?
I have, on at least one occasion, completely switched a political position in about half an hour after hearing an argument I had not previously considered. More generally, we humans tend to update our beliefs, our strategies, and what-we-believe-to-be-our-values as new implications are realized.
Embedded agency problem: logical non-omniscience. We do not understand the full implications of what we know, and sometimes we base our decisions/strategies/what-we-believe-to-be-our-values on flawed logic. How is a friendly AI to recognize and handle such cases?
Socially Strategic Self-Modification
Because humans are all embedded in one physical world, lying is hard. There are side-channels which leak information, and humans have long since evolved to pay attention to those side-channels. One side effect: the easiest way to “deceive” others is to deceive oneself, via self-modification. Embedded agency problem: coordination with visible source code, plus self-modification.
We earnestly adopt both the beliefs and values of those around us. Are those our “true” values? How should a friendly AI treat values adopted due to social pressure? More generally, how should a friendly AI handle human self-modifications driven by social pressure?
Combining this with earlier examples: perhaps we spend an evening drunk because it gives us a socially-viable excuse to do whatever we wanted to do anyway. Then the next day, we bow to social pressure and earnestly regret our actions of the previous night - or at least some of our subsystems do. Other subsystems still had fun while drunk, and we do the same thing the next weekend. What is a friendly AI to make of this? Where, in this mess, are the humans’ “values”?
These are the sorts of shenanigans one needs to deal with when dealing with embedded agents, and I expect that a better understanding of embedded agents in general will lead to substantial insights about the nature of human values.