What’s the type signature of an agent?
For instance, what kind-of-thing is a “goal”? What data structures can represent “goals”? Utility functions are a common choice among theorists, but they don’t seem quite right. And what are the inputs to “goals”? Even when using utility functions, different models use different inputs - Coherence Theorems imply that utilities take in predefined “bet outcomes”, whereas AI researchers often define utilities over “world states” or “world state trajectories”, and human goals seem to be over latent variables in humans’ world models.
And that’s just goals. What about “world models”? Or “agents” in general? What data structures can represent these things, how do they interface with each other and the world, and how do they embed in their low-level world? These are all questions about the type signatures of agents.
One general strategy for answering these sorts of questions is to look for what I’ll call Selection Theorems. Roughly speaking, a Selection Theorem tells us something about what agent type signatures will be selected for (by e.g. natural selection or ML training or economic profitability) in some broad class of environments. In inner/outer agency terms, it tells us what kind of inner agents will be selected by outer optimization processes.
We already have many Selection Theorems: Coherence and Dutch Book theorems, Good Regulator and Gooder Regulator, the Kelly Criterion, etc. These theorems generally seem to point in a similar direction - suggesting deep unifying principles exist - but they have various holes and don’t answer all the questions we want. We need better Selection Theorems if they are to be a foundation for understanding human values, inner agents, value drift, and other core issues of AI alignment.
The quest for better Selection Theorems has a lot of “surface area” - lots of different angles for different researchers to make progress, within a unified framework, but without redundancy. It also requires relatively little ramp-up; I don’t think someone needs to read the entire giant corpus of work on alignment to contribute useful new Selection Theorems. At the same time, better Selection Theorems directly tackle the core conceptual problems of alignment and agency; I expect sufficiently-good Selection Theorems would get us most of the way to solving the hardest parts of alignment. Overall, I think they’re a good angle for people who want to make useful progress on the theory of alignment and agency, and have strong theoretical/conceptual skills.
Outline of this post:
- More detail on what “type signatures” and “Selection Theorems” are
- Examples of existing Selection Theorems and what they prove (or assume) about agent type signatures
- Aspects which I expect/want from future Selection Theorems
- How to work on Selection Theorems
What’s A Type Signature Of An Agent?
We’ll view the “type signature of an agent” as an answer to three main questions:
- Representation: What “data structure” represents the agent - i.e. what are its high-level components, and how can they be represented?
- Interfaces: What are the “inputs” and “outputs” between the components - i.e. how do they interface with each other and with the environment?
- Embedding: How does the abstract “data structure” representation relate to the low-level system in which the agent is implemented?
A selection theorem typically assumes some parts of the type signature (often implicitly), and derives others.
For example, coherence theorems show that any non-dominated strategy is equivalent to maximization of Bayesian expected utility.
- Representation: utility function and probability distribution.
- Interfaces: both the utility function and distribution take in “bet outcomes”, assumed to be specified as part of the environment. The outputs of the agent are “actions” which maximize expected utility under the distribution; the inputs are “observations” which update the distribution via Bayes’ Rule.
- Embedding: “agent” must interact with “environment” only via the specified “bets”. Utility function and distribution relate to low-level agent implementation via behavioral equivalence.
Coherence theorems fall short of what we ultimately want in a lot of ways: neither the assumptions nor the type signature are quite the right form for real-world agents. (More on that later.) But they’re a good illustration of what a selection theorem is, and how it tells us about the type signature of agents.
Here are some examples of “type signature” questions for specific aspects of agents:
- World models
- Does the agent have a world model or models?
- What data structure can represent an agent’s world model? (Probability distributions are the most common choice.)
- How does the agent’s world model correspond to the world? (For instance, which physical things do the random variables in a probability distribution correspond to, if any?)
- What’s the relationship between the abstract “world model” and the physical stuff from which the world model is built?
- Does the agent have a goal or goals?
- What data structure can represent an agent’s goal? (Utility functions are the most common choice.)
- How does the goal correspond to the world - especially if it’s evaluated within the world model?
- Does the agent have well-defined goals or world models or other components?
- Does the agent perform search/optimization within the world model, or in the world directly?
- What are the agent’s “inputs” and “outputs” - e.g. actions and observations?
- Does agent-like behavior imply agent-like internal architecture?
- What’s the relationship between the abstract “agent” and the physical stuff from which the agent is built?
What’s A Selection Theorem?
A Selection Theorem tells us something about what agent type signatures will be selected for in some broad class of environments. Two important points:
- The theorem need not directly talk about selection - e.g. it could state some general property of optima, of “broad” optima, of “most” optima, or of optima under a particular kind of selection pressure (like natural selection or financial profitability).
- Any given theorem need not address every question about agent type signatures; it just needs to tell us something about agent type signatures.
For instance, the subagents argument says that, when our “agents” have internal state in a coherence-theorem-like setup, the “goals” will be pareto optimality over multiple utilities, rather than optimality of a single utility function. This says very little about embeddedness or world models or internal architecture; it addresses only one narrow aspect of agent type signatures. And, like the coherence theorems, it doesn’t directly talk about selection; it just says that any strategy which doesn’t fit the pareto-optimal form is strictly dominated by some other strategy (and therefore we’d expect that other strategy to be selected, all else equal).
Most Selection Theorems, in the short-to-medium term, will probably be like that: they’ll each address just one particular aspect of agent type signatures. That’s fine. As long as the assumptions are general enough and realistic enough, we can use lots of theorems together to narrow down the space of possible types.
Eventually, I do expect that most of the core ideas of Selection Theorems will be unified into a small number of Fundamental Theorems of Agency - perhaps even a single theorem. But that’s not a necessary assumption for the usefulness of this program, and regardless, I expect a lot of progress on theorems addressing specific aspects of agent type signatures before then.
How to work on Selection Theorems
The most open-ended way to work on the Selection Theorems program is, of course, to come up with new Selection Theorems.
If you’re relatively-new to this sort of work and wondering how one comes up with useful new theorems, here are some possible starting points:
- Study examples of evolved agents to see what kind of type signatures they develop under what conditions. I recommend coming from as many different angles as possible - i.e. ML, economics, and biology - to build intuitions.
- Once you have some intuition or empirical fact from some specific examples or a particular field, try to expand it to more general agents and selection processes.
- Bio example: sessile (i.e. immobile) organisms don’t usually cephalize (i.e. develop brains). Can we turn this into a general theorem about agents?
- Pick a frame, and try to apply it. For example, I’ve been getting surprisingly a lot of mileage out of the comparative advantage frame lately; it turns out to give some neat variants of the Coherence Theorems.
- Start from agent type signatures - what type signature makes sense intuitively, based on how humans work? What selection processes would give rise to that type signature, and can you prove it?
- Start from selection processes. What type signatures seem intuitively likely to be selected? Can you prove it?
Also, take a look at What’s So Bad About Ad-Hoc Mathematical Definitions? to help build some useful aesthetic intuitions.
This is work which starts from one or more existing selection theorem(s), and improves on them somehow.
Some starting points with examples where I’ve personally found them useful before:
- Take an existing selection theorem, try to apply it to some real-world agency system or some system under selection pressure, see what goes wrong, and fix it. For instance, the subagents idea started from trying (and failing) to apply Coherence Theorems to financial markets.
- Take some existing theorem and strengthen it. For instance, the original logical inductors piece showed the existence of a logical inductor implemented as a market; I extended that to show that any logical inductor is behaviorally equivalent to a market.
- Take some existing theorem with a giant gaping hole and fix the hole. Gooder Regulator was basically that.
A couple other approaches for which I don’t have a great example from my own work, but which I expect to be similarly fruitful:
- Take two existing selection theorems and unify them.
- Take a selection theorem mainly designed for a particular setting (e.g. financial/betting markets) and back out the exact requirements needed to apply it in more general settings.
- Empirical verification, i.e. check that an existing theorem works as expected on some real system. This is most useful when it fails, but success still helps us be sure our theorems aren’t missing anything, and the process of empirical testing forces us to better understand the theorems and their assumptions.
I currently have two follow-up posts planned:
- One post with some existing Selection Theorems, which is already written and should go up later this week. [Edit: post is up.]
- One post on agent type signatures for which I expect/want Selection Theorems - in other words, conjectures. This one is not yet written, and I expect it will go up early next week.
These are explicitly intended to help people come up with ways to contribute to the Selection Theorems program.