Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Selection Theorems: A Program For Understanding Agents

28th Sep 2021

19evhub

13johnswentworth

3evhub

18DragonGod

4Gyrodiot

3DragonGod

16Linda Linsefors

7johnswentworth

7Charlie Steiner

2johnswentworth

2Charlie Steiner

2johnswentworth

2Charlie Steiner

3johnswentworth

2Charlie Steiner

6Thomas Kwa

6Vika

4adamShimi

4Rohin Shah

2johnswentworth

4Rohin Shah

4johnswentworth

4Rohin Shah

4johnswentworth

4Rohin Shah

5johnswentworth

4Gordon Seidoh Worley

1Purged Deviator

New Comment

28 comments, sorted by Click to highlight new comments since: Today at 12:46 AM

Have you seen Mark and my “Agents Over Cartesian World Models”? Though it doesn't have any Selection Theorems in it, and it just focuses on the type signatures of goals, it does go into a lot of detail about possible type signatures for agent's goals and what the implications of those type signatures would be, starting from the idea that a goal can be defined on any part of a Cartesian boundary.

Oh excellent, that's a perfect reference for one of the successor posts to this one. You guys do a much better job explaining what agent type signatures are and giving examples and classification, compared to my rather half-baked sketch here.

Thanks! I hope the post is helpful to you or anyone else trying to think about the type signatures of goals. It's definitely a topic I'm pretty interested in.

I am an aspiring selection theorist and I have thoughts.

Learning about selection theorems was very exciting. It's one of those concepts that felt so obviously *right*. A missing component in my alignment ontology that just clicked and made everything stronger.

There are many reasons to be sympathetic to agent foundations style safety research as it most directly engages the hard problems/core confusions of alignment/safety. However, one concern with agent foundations research is that we might build sky high abstraction ladders that grow increasingly disconnected from reality. Abstractions that don't quite describe the AI systems we deal with in practice.

I think that in presenting this post, Wentworth successfully sidestepped the problem. He presented an intuitive story for why the Selection Theorems paradigm would be fruitful; it's general enough to describe many paradigms of AI system development, yet concrete enough to say nontrivial/interesting things about the properties of AI systems (including properties that bear on their safety). Wentworth presents a few examples of extant selection theorems (most notably the coherence theorems) and later argues that selection theorems have a lot of research "surface area" and new researchers could be onboarded (relatively) quickly. He also outlined concrete steps people interested in selection theorems could take to contribute to the program.

Overall, I found this presentation of the case for selection theorems research convincing. I think that selection theorems provide a solid framework with which to formulate (and prove) safety desiderata/guarantees for AI systems that are robust to arbitrary capability amplification. Furthermore, selection theorems seem to be very robust to paradigm shifts in the development artificial intelligence. That is regardless of what changes in architecture or training methodology subsequent paradigms may bring, I expect selection theoretic results to still apply^{[1]}.

I currently consider selection theorems to be the most promising agent foundations flavoured research paradigm.

My preferred analogy for selection theorems is asymptotic complexity in computer science. Using asymptotic analysis we can make highly non-trivial statements about the performance of particular (or arbitrary!) algorithms that abstract away the underlying architecture, hardware, and other implementation details. As long as the implementation of the algorithm is amenable to our (very general) models of computation, the asymptotic/complexity theoretic guarantee will generally still apply.

For example, we have a very robust proof that no comparison-based sorting algorithm can attain better worst case time complexity than (this happens to be a very tight lower bound as extant algorithms [e.g. mergesort] attain it). The model behind the lower bound of comparison sorting is very minimal and general:

- Data operations
- Comparing two elements
- Moving elements (copying or swapping)

- Cost: number of such operations

Any algorithm that performs sorting by directly comparing elements to determine their order conforms to this model. The lower bound of obtains because we can model the execution of the sorting algorithm by a binary decision tree:

- Nodes: individual comparisons between elements
- Edges: different outputs of comparisons ( and )
- Leaf nodes: unique permutation of the input array that corresponds to that particular root to leaf path of the tree

The number of executions of the sorting algorithm for any given input permutation is given by the number of edges between the root node and that leaf. The worst case running time of the algorithm is given by the height of the tree. Because there are possible permutations of the input array, the lowest attainable worst case complexity is . Which is in .

I reiterate that this is a very powerful result. Here we've set up very minimal assumptions about our model (comparisons are made between pairs of elements to determine order, the algorithm can copy or swap elements) and we've obtained a ridiculously strong impossibility result^{[2]}.

Selection theorems present a minimal model of an intelligent system as an agent situated in an environment. The agents are assumed to be the product of some optimisation process selecting for performance on a given metric (e.g. inexploitability in multi-agent environments, for the coherence theorems).

The exact optimisation process performing the selection is abstracted away [only the performance metric/objective function(s) of optimisation matters], and the hope is to do the same for the environment (that is, selection theoretic results should apply to a broad class of environments (e.g. for the coherence theorems, the only constraint imposed on the environment is that it contains other agents)].

Using the above model, selection theorems try to derive^{[3]} agent "type signatures" (the representation [data structures], interfaces [inputs & outputs] and embedding [in an underlying physical (or other low level) system] of the agent and specific agent aspects (world models, goals, etc.). It's through these type signatures that safety relevant properties of agents can be concretely formulated (and hopefully proven).

For example, the proposed anti-naturalness of corrigibility to expected utility maximisation can be seen as an "impossibility result"^{[4]} of a safety property (corrigibility) derived from a selection theorem (the coherence theorems).

While this is a negative result, I expect no fundamental difficulty to obtaining positive selection theoretic guarantees of safety properties.

**I see the promise of selection theorems as doing for AI safety, what ****complexity theory**** does for algorithm performance.**

I expect that we will be able to provide selection theoretic guarantees of nontrivial safety properties/desiderata.

In particular, I think selection theorems naturally lend themselves to proving properties that are selected for/emerge *in the limit*^{[5]} of optimisation for particular objectives (convergence theorems?). I find the potential of asymptotic guarantees exhilarating.

Properties proven to emerge in the limit become *more robust* with (increasing) scale. I think that's an incredibly powerful result. Furthermore, asymptotic complexity analysis suggests that it's often easier to make statements about what holds in the limit than about what holds at particular intermediate levels. (We can very easily talk about how the performance of two algorithms compare on a particular problem as input size tends towards infinity without considering implementation details or underlying hardware and ignoring all constant factors. To talk about the performance of two algorithms on input sets of a particular fixed size, we'd need to consider all the aforementioned details).

The combination of:

- "Properties that are selected for in the limit become more robust with (increasing) scale" and
- "It is much easier to describe the limit of a process than particular intermediate states"

is immensely powerful^{[6]}. It makes selection theorems a hugely compelling — perhaps the one I find most personally compelling — AI safety research paradigm.

While I am quite enamoured with Wentworth’s selection theorems, I find myself somewhat dissatisfied. As Wentworth framed it, I think they are a bit off.

A major limitation of the coherence theorems is that they constrain agents to an archetype that does not necessarily describe real agents (or other intelligent systems) well. In particular, the coherence theorems assume agent preferences are:

- Static (do not change with time)
- Path independent (exact course of action taken to get somewhere does not affect the agent's preferences, alternatively it assumes that agents do not have internal states that factor into their preferences)
- Complete (for any two options, the agent prefers one of them or is indifferent. It doesn't permit a notion of "incomplete preferences")

The failure of coherence theorems to carve reality at the joints is a valuable lesson re: choosing the right preconditions for our theorems (if our preconditions are too restrictive/strong, they might describe systems that don't matter in the real world ["spherical cows"]). And it's a mistake I worry that the paradigm of "agent type signatures" might be making.

To be precise, I am quite unconvinced that “agent” is the “true name” of the relevant intelligent systems. There are powerful artifacts (e.g. the base versions of large language models) that do not match the agent archetype as traditionally conceived. I do not know that the artifacts that ultimately matter would necessarily conform to the agent archetype^{[7]}. Theorems that are exclusively about the properties of agents may end up not being very applicable to important systems of interest (if e.g. the first AGIs are created by a [mostly] self-supervised training process).

Agent selection theorems are IMO ultimately too restrictive (their preconditions are too strong to describe all intelligent systems of interest/they implicitly preclude from analysis some intelligent systems we'll be interested in), and the selection theorem agenda should be generalised to optimisation processes and the kind of constructs they select for.

That is, regardless of paradigm, intelligent systems (e.g. humans, trained ML models and expert systems) are the products of optimisation processes (e.g. natural selection, stochastic gradient descent, and human design^{[8]} respectively).

So, a theory based solely on optimisation processes seems general enough to describe all intelligent systems of interest (while being targeted enough to say nontrivial/interesting things about such systems) and *minimal* (we can't relax the preconditions anymore while still obtaining nontrivial results about intelligent systems).

The agent type signature paradigm is insufficiently general.

In the remainder of this post, I would like to slightly adjust the concept of selection theorems to better reflect what I think they should be^{[9]}.

There are two broad classes of theorems that seem valuable:

For a given (collection of) objective(s), and underlying configuration space what type^{[10]} of artifacts are produced by constructive optimisation processes (e.g. natural selection, stochastic gradient descent and human design) that select for performance on said objective(s)?

Fundamentally, they ask the question:

What properties are selected for by optimisation for a particular (collection of) objective function(s)?

The aforementioned "convergence theorems" would be a particular kind of constructor theorems.

Artifact theorems are the dual of constructor theorems. If constructor theorems seek to identify the artifact type produced by a particular constructive optimisation process, then artifact theorems seek to identify the constructive optimisation process that produced particular artifacts (the human brain, trained ML models and the quicksort algorithm respectively).

That is:

For a given artifact type and associated configuration spaces, what were the objectives

^{[11]}of the optimisation process that produced it?

- I.e. describe the class of problems/domains/tasks the objectives belong to
- Can we also specify a type for the objectives?

- What properties do its members have?
- Which properties are necessary to select for that artifact type?
- What is its parent type?

- Which properties are sufficient?
- What are the interesting child types?

I suspect that e.g. investigating general intelligence artifact theorems would be a promising research agenda for robust safety of arbitrarily capable general systems.

^{^}Provided we use sufficiently general agent/system models as the foundation for our selection theoretic results.

^{^}I should point out that this impossibility result is somewhat atypical; for many interesting problems we don't regularly obtain (non-trivial [e.g. the size of the input or output]) tight lower bounds on complexity.

^{^}Usually, some parts of the type signatures are assumed (implicitly or explicitly) by the theorem.

^{^}Jessica Taylor told me that she thinks the anti-naturalness of corrigibility is more of a "research intuition" than a theorem.

^{^}I'm under the impression that it was when thinking about what emerges in the limit that I first drew the relationship between selection theorems and complexity theory. However, this may be a false memory (or otherwise not a particularly reliable recollection of events).

^{^}It feels almost too good to be true, like we're cheating in the mileage we get out of selection theorems.

^{^}While any physical system can be constituted as an agent situated in an environment, the agent archetype is not illuminating for all of them. Viewing a calculator as an agent does not really offer any missing insight into the operations of the calculator. It does not allow you to better predict its behaviour.

^{^}Insomuch as one accepts that design is a kind of optimisation process. And I would insist that you should, but I've not gotten around to writing up my thoughts on what exactly qualifies as an optimisation process in a form that I would endorse. Eliezer's "Measuring Optimisation Power" is a fine enough first approximation

^{^}The quickest gloss is that:

- “Agent” should be replaced with “artifact” (a general term for any object that is the product of an optimisation process).

Some sample artifacts and the optimisation process that produced them:

* The human brain: natural selection

* Trained ML models: stochastic gradient descent

* : Newton's method (approximation for the square root of )

* The quicksort algorithm: human design

^{^}Among other things, a type should specify a set of properties that all members of the type share. If those properties are necessary and sufficient for an artifact to belong to a particular type, the type could simply be identified with its collection of properties.

Types can exist at different levels of abstraction (allowing them to specify artifact properties at different levels of detail).

An artifact can belong to multiple types (e.g. I might belong to the types: "human", "male", "Nigerian").

^{^}Rather than identifying the optimisation process in detail, only the objective function of the optimisation process is considered. Any other particulars/specifics of the optimisation process are abstracted away (the same way implementation details are abstracted away in asymptotic analysis).

The motivation is that I think that any two optimisation processes with the same objective functions on the same configuration space with the same "optimisation power" are identical for our purposes. And for convergence theorems, even the optimisation power is abstracted away.

Not sure how usefull this is, but I think this counts as a selection theorem.

(Paper by Caspar Oesterheld, Joar Skalse, James Bell and me)

https://proceedings.neurips.cc/paper/2021/hash/b9ed18a301c9f3d183938c451fa183df-Abstract.html

We played around with taking learning algorithms designed for multi armed bandit problems (your action matters but not your policy) and placing them in Newcomblike environments (both your acctual action and your probability distribution over actions matters). And then we proved some stuf about their behaviour.

Hm. Suppose sometimes I want to model humans as having propositional beliefs, and other times I want to model humans as having probabilistic beliefs, and still other times I want to model human beliefs as a set of contexts and a transition function. What's stopping me?

I think it depends on the application. What seems like the obvious application is building an AI that models human beliefs, or human preferences. What are some of the desiderata we use when choosing how we want an AI to model us, and how do these compare to typical desiderata used in picking model classes for agents?

I like Savage, so I'll pick on him. Before you even get into what he considers the "real" desiderata, he wants to say that there's a set of actions which are functions from states to consequences, and this set is closed under the operation of using one action for some arbitrary states and another action for the rest. But humans very don't work that way - I'd want a model of humans to account for complicated, psychology-dependent limitations on what actions we consider taking.

Or if we're thinking about modeling humans to extract the "preferences" part of the model: Suppose Person A wants to get out a function that ranks actions, while Person B wants to learn a utility function, its domain of validity, and a custom world-model that the utility function lives in. What's the model for how something like a selection theorem will help them resolve their differences?

You want a model of humans to account for complicated, psychology-dependent limitations on what actions we consider taking. So: what process produced this complicated psychology? Natural selection. What data structures can represent that complicated psychology? That's a type signature question. Put the two together, and we have a selection-theorem-shaped question.

In the example with persons A and B: a set of selection theorems would offer a solid foundation for the type signature of human preferences. Most likely, person B would use whatever types the theorems suggest, rather than a utility function, but if for some reason they really wanted a utility function they would probably compute it as an approximation, compute the domain of validity of the approximation, etc. For person A, turning the relevant types into an action-ranking would likely work much the same way that turning e.g. a utility function into an action-ranking works - i.e. just compute the utility (or whatever metrics turn out to be relevant) and sort. Regardless, if extracting preferences, both of them would probably want to work internally with the type signatures suggested by the theorems.

We can imagine modeling humans in purely psychological ways with no biological inspiration, so I think you're saying that you want to look at the "natural constraints" on representations / processes, and then in a sense generalize or over-charge those constraints to narrow down model choices?

Basically, yes. Though I would add that narrowing down model choices in some legible way is a necessary step if, for instance, we want to be able to *interface* with our models in any other way than querying for probabilities over the low-level state of the system.

Right. I think I'm more of the opinion that we'll end up choosing those interfaces via desiderata that apply more directly to the interface (like "we want to be able to compare two models' ratings of the same possible future"), rather than indirect desiderata on "how a practical agent should look" that we keep adding to until an interface pops out.

The problem with that sort of approach is that the system (i.e. agent) being modeled is not necessarily going to play along with whatever desiderata we want. We can't just be like "I want an interface which does X"; if X is not a natural fit for the system, then what pops out will be very misleading/confusing/antihelpful.

An oversimplified example: suppose I have some predictive model, and I want an interface which gives me a point estimate and confidence interval/region rather than a full distribution. That only works well if the distribution isn't multimodal in any important way. If it is importantly multimodal, then *any* point estimate will be very misleading/confusing/antihelpful.

More generally, the take away here is "we don't get to arbitrarily choose the type signature"; that choice is dependent on properties of the system.

This might be related to the notion that if we try to dictate the form of a model ahead of time (i.e. some of the parameters are labeled "world model" in the code, and others are labeled "preferences", and inference is done by optimizing the latter over the former), but then just train it to minimize error, the actual content of the parameters after training doesn't need to respect our preconceptions. What the model really "wants" to do in the limit of lots of compute is find a way to encode an accurate simulation of the human in the parameters in a way that bypasses the simplifications we're trying to force on it.

For this problem, which might not be what you're talking about, I think a lot of the solution is algorithmic information theory. Trying to specify neat, human-legible parts for your model (despite not being able to train the parts separately) is kind of like choosing a universal Turing machine made of human-legible parts. In the limit of big powerfulness, the Solomonoff inductor will throw off your puny shackles and simulate the world in a highly accurate (and therefore non human-legible) way. The solution is not better shackles, it's an inference method that trades off between model complexity and error in a different way.

(P.S.: I think there is an "obvious" way to do that, and it's MML learning with some time constant used to turn error rates into total discounted error, which can be summed with model complexity.)

I wish I could review this for the 2022 review, but it's from 2021.

I think this post is pretty valuable. The big takeaway for me was that any argument involving coherence should cache out in selection.

One caveat is that I haven't seen much research into selection theorems in the intervening couple of years, and adjacent things like inductive bias research don't seem to have good applications yet. Maybe it's too hard for where the field is right now.

I like this research agenda because it provides a rigorous framing for thinking about inductive biases for agency and gives detailed and actionable advice for making progress on this problem. I think this is one of the most useful research directions in alignment foundations since it is directly applicable to ML-based AI systems.

Just posted an analysis of the epistemic strategies underlying selection theorems and their applications. Might be interesting for people who want to go further with selection theorem, either by proving one or by critiquing one.

Planned summary for the Alignment Newsletter:

This post proposes a research area for understanding agents: **selection theorems**. A selection theorem is a theorem that tells us something about agents that will be selected for in a broad class of environments. Selection theorems are helpful because they tell us likely properties of the agents we build.

As an example, [coherence arguments](

https://www.alignmentforum.org/posts/RQpNHSiWaXTvDxt6R/coherent-decisions-imply-consistent-utilities) demonstrate that when an environment presents an agent with “bets” or “lotteries”, where the agent cares only about the outcomes of the bets, then any non-dominated agent can be represented as maximizing expected utility. (What does it mean to be non-dominated? This can vary, but one example would be that the agent is not subject to Dutch books, i.e. situations in which it is guaranteed to lose money.) If you combine this with the very reasonable assumption that we will tend to build non-dominated agents, then we can conclude that we select for agents that can be represented as maximizing expected utility.Coherence arguments aren’t the only kind of selection theorem. The <@good(er) regulator theorem@>(@Fixing The Good Regulator Theorem@) provides a set of scenarios under which agents learn an internal “world model”. The [Kelly criterion](

http://www.eecs.harvard.edu/cs286r/courses/fall10/papers/Chapter6.pdf) tells us about scenarios in which the best (most selected) agents will make bets as though they are maximizing expected log money. These and other examples are described in [this followup post](https://www.alignmentforum.org/posts/N2NebPD78ioyWHhNm/some-existing-selection-theorems).The rest of this post elaborates on the various parts of a selection theorem, and provides advice on how to make original research contributions in the area of selection theorems. Another [followup post](

https://www.alignmentforum.org/posts/RuDD3aQWLDSb4eTXP/what-selection-theorems-do-we-expect-want) describes some useful properties for which the author expects there are useful selections theorems to prove.

Planned opinion:

People sometimes expect me to be against this sort of work, because I wrote <@Coherence arguments do not imply goal-directed behavior@>. This is not true. My point in that post is that coherence arguments _alone_ are not enough, you need to combine them with some other assumption (for example, that there is a money-like resource over which the agent has no terminal preferences). Similarly, I don’t expect this research agenda to find a selection theorem that says that an existential catastrophe occurs _assuming only that the agent is intelligent_, but I do think it is plausible that this research agenda gives us a better picture of agency that tells us something about how AI systems will behave, because we think the assumptions involved in the theorems are quite likely to hold. While I am personally more excited about studying particular development paths to AGI rather than more abstract agent models, I would not actively discourage anyone from doing this sort of research, and I think it would be more useful than other types of research I have seen proposed.

A few comments...

Selection theorems are helpful because they tell us likely properties of the agents we build.

What are selection theorems helpful for? Three possible areas (not necessarily comprehensive):

- Properties of humans as agents (e.g. "human values")
- Properties of agents which we intentionally aim for (e.g. what kind of architectural features are likely to be viable)
- Properties of agents which we accidentally aim for (e.g. inner agency issues)

Of these, I expect the first to be most important, followed by the last, although this depends on the relative difficulty one expects from inner vs outer alignment, as well as the path-to-AGI.

(What does it mean to be non-dominated? This can vary, but one example would be that the agent is not subject to Dutch books, i.e. situations in which it is guaranteed to lose money.)

"Non-dominated" is always (to my knowledge) synonymous with "Pareto optimal", same as the usage in game theory. It varies only to the extent that "pareto optimality of what?" varies; in the case of coherence theorems, it's Pareto optimality with respect to a single utility function over multiple worlds. (Ruling out Dutch books is downstream of that: a Dutch book is a Pareto loss for the agent.)

If you combine this with the very reasonable assumption that we will tend to build non-dominated agents, then we can conclude that we select for agents that can be represented as maximizing expected utility.

... I mean, that's a valid argument, though kinda misses the (IMO) more interesting use-cases, like e.g. "if evolution selects for non-dominated agents, then we conclude that evolution selects for agents that can be represented as maximizing expected utility, and therefore humans are selected for maximizing expected utility". Humans fail to have a utility function not because that argument is wrong, but because the implicit assumptions in the existing coherence theorems are too strong to apply to humans. But this is the sort of argument I hope/expect will work for better selection theorems.

(Also, I would like to emphasize here that I think the current coherence theorems have major problems in their implicit assumptions, and these problems are the main reason they fail for real-world agents, especially humans.)

Thanks for this and the response to my other comment, I understand where you're coming from a lot better now. (Really I should have figured it out myself, on the basis of this post.) New summary:

This post proposes a research area for understanding agents: **selection theorems**. A selection theorem is a theorem that tells us something about agents that will be selected for in a broad class of environments. Selection theorems are helpful because (1) they can provide additional assumptions that can help with learning values by observing human behavior, and (2) they can tell us likely properties of the agents we build by accident (think inner alignment concerns).

As an example, [coherence arguments](https://www.alignmentforum.org/posts/RQpNHSiWaXTvDxt6R/coherent-decisions-imply-consistent-utilities) demonstrate that when an environment presents an agent with “bets” or “lotteries”, where the agent cares only about the outcomes of the bets, then any “good” agent can be represented as maximizing expected utility. (What does it mean to be “good”? This can vary, but one example would be that the agent is not subject to Dutch books, i.e. situations in which it is guaranteed to lose resources.) This can then be turned into a selection argument by combining it with something that selects for “good” agents. For example, evolution will select for agents that don’t lose resources for no gain, so humans are likely to be represented as maximizing expected utility. Unfortunately, many coherence arguments implicitly assume that the agent has no internal state, which is not true for humans, so this argument does not clearly work. As another example, our ML training procedures will likely also select for agents that don’t waste resources, which could allow us to conclude that the resulting agents can be represented as maximizing expected utility.

Coherence arguments aren’t the only kind of selection theorem. The <@good(er) regulator theorem@>(@Fixing The Good Regulator Theorem@) provides a set of scenarios under which agents learn an internal “world model”. The [Kelly criterion](http://www.eecs.harvard.edu/cs286r/courses/fall10/papers/Chapter6.pdf) tells us about scenarios in which the best (most selected) agents will make bets as though they are maximizing expected log money. These and other examples are described in [this followup post](https://www.alignmentforum.org/posts/N2NebPD78ioyWHhNm/some-existing-selection-theorems).

The rest of this post elaborates on the various parts of a selection theorem, and provides advice on how to make original research contributions in the area of selection theorems. Another [followup post](https://www.alignmentforum.org/posts/RuDD3aQWLDSb4eTXP/what-selection-theorems-do-we-expect-want) describes some useful properties for which the author expects there are useful selections theorems to prove.

New opinion:

People sometimes expect me to be against this sort of work, because I wrote <@Coherence arguments do not imply goal-directed behavior@>. This is not true. My point in that post is that coherence arguments _alone_ are not enough, you need to combine them with some other assumption (for example, that there exists some “resource” over which the agent has no terminal preferences). I do think it is plausible that this research agenda gives us a better picture of agency that tells us something about how AI systems will behave, or something about how to better infer human values. While I am personally more excited about studying particular development paths to AGI rather than more abstract agent models, I do think this research would be more useful than other types of alignment research I have seen proposed.

I think that's a reasonable summary as written. Two minor quibbles, which you are welcome to ignore:

Selection theorems are helpful because (1) they can provide additional assumptions that can help with learning values by observing human behavior

I agree with the literal content of this sentence, but I personally don't imagine limiting it to behavioral data. I expect embedding-relevant selection theorems, which would also open the door to using internal structure or low-level dynamics of the brain to learn values (and human models, precision of approximations, etc).

Unfortunately, many coherence arguments implicitly assume that the agent has no internal state, which is not true for humans, so this argument does not clearly work. As another example, our ML training procedures will likely also select for agents that don’t waste resources, which could allow us to conclude that the resulting agents can be represented as maximizing expected utility.

Agents selected by ML (e.g. RL training on games) also often have internal state.

Edited to

Selection theorems are helpful because (1) they can provide additional assumptions that can help with learning human values

and

[...] the resulting agents can be represented as maximizing expected utility, if the agents don't have internal state.

(For the second one, that's one of the reasons why I had the weasel word "could", but on reflection it's worth calling out explicitly given I mention it in the previous sentence.)

At the same time,

better Selection Theorems directly tackle the core conceptual problems of alignment and agency; I expect sufficiently-good Selection Theorems would get us most of the way to solving the hardest parts of alignment.

The former statement makes sense, but can you elaborate on the latter statement? I suppose I could imagine selection theorems revealing that we really do get alignment by default, but I don't see how they quickly lead to solutions to AI alignment if there is a problem to solve.

The biggest piece (IMO) would be figuring out key properties of human values. If we look at e.g. your sequence on value learning, the main takeaway of the section on ambitious value learning is "we would need more assumptions". (I would also argue we need *different* assumptions, because some of the currently standard assumptions are wrong - like utility functions.)

That's one thing selection theorems offer: a well-grounded basis for new assumptions for ambitious value learning. (And, as an added bonus, directly bringing selection into the picture means we also have an angle for characterizing how much precision to expect from any approximations.) I consider this the current main bottleneck to progress on outer alignment: we don't even understand what kind-of-thing we're trying to align AI *with*.

(Side-note: this is also the main value which I think the Natural Abstraction Hypothesis offers: it directly tackles the Pointers Problem, and tells us what the "input variables" are for human values.)

Taking a different angle: if we're concerned about malign inner agents, then selection theorems would potentially offer both (1) tools for characterizing selection pressures under which agents are likely to arise (and what goals/world models those agents are likely to have), and (2) ways to look for inner agents by looking directly at the internals of the trained systems. I consider our inability to do (2) in any robust, generalizable way to be the current main bottleneck to progress on inner alignment: we don't even understand what kind-of-thing we're supposed to look for.

Interesting. Selection theorems seem like a way of identifying the purposes or source of goal directness in agents that seems obvious to us yet hard to pin down. Compare also the ground of optimization.

Started reading, want to get initial thoughts down before they escape me. Will return when I am done

Representation: An agent, or an agent's behaviour, is a script. I don't know if that's helpful or meaningful. Interfaces: Real, in-universe agents have hardware on which they operate. I'd say they have "sensors and actuators", but that's tautologous to "inputs and outputs". Embedding: In biological systems, the script is encoded directly in the structure of the wetware. The hardware / software dichotomy has more separation, but I think I'm probably misunderstanding this.

What’s the type signature of an agent?

For instance, what kind-of-thing is a “goal”? What data structures can represent “goals”? Utility functions are a common choice among theorists, but they

don’t seem quite right. And what are the inputs to “goals”? Even when using utility functions, different models use different inputs -Coherence Theoremsimply that utilities take in predefined “bet outcomes”, whereas AI researchers often define utilities over “world states” or “world state trajectories”, andhuman goals seem to be over latent variables in humans’ world models.And that’s just goals. What about “world models”? Or “agents” in general? What data structures can represent these things, how do they interface with each other and the world, and how do they

embedin their low-level world? These are all questions about the type signatures of agents.One general strategy for answering these sorts of questions is to look for what I’ll call Selection Theorems. Roughly speaking,

a Selection Theorem tells us something about what agent type signatures will be selected for (by e.g. natural selection or ML training or economic profitability) in some broad class of environments. In inner/outer agency terms, it tells us what kind of inner agents will be selected by outer optimization processes.We already have many Selection Theorems:

Coherence and Dutch Book theorems,Good RegulatorandGooder Regulator, theKelly Criterion, etc. These theorems generally seem to point in a similar direction - suggesting deep unifying principles exist - but they have various holes and don’t answer all the questions we want. We need better Selection Theorems if they are to be a foundation for understanding human values, inner agents, value drift, and other core issues of AI alignment.The quest for better Selection Theorems has a lot of “surface area”- lots of different angles for different researchers to make progress, within a unified framework, but without redundancy. It also requiresrelativelylittle ramp-up; I don’t think someone needs to read the entire giant corpus of work on alignment to contribute useful new Selection Theorems. At the same time,better Selection Theorems directly tackle the core conceptual problems of alignment and agency; I expect sufficiently-good Selection Theorems would get us most of the way to solving the hardest parts of alignment. Overall, I think they’re a good angle for people who want to make useful progress on the theory of alignment and agency, and have strong theoretical/conceptual skills.Outline of this post:

## What’s A Type Signature Of An Agent?

We’ll view the “type signature of an agent” as an answer to three main questions:

A selection theorem typically assumes some parts of the type signature (often implicitly), and derives others.

For example,

coherence theoremsshow that any non-dominated strategy is equivalent to maximization of Bayesian expected utility.Coherence theorems fall short of what we ultimately want in a lot of ways: neither the assumptions nor the type signature are quite the right form for real-world agents. (More on that later.) But they’re a good illustration of what a selection theorem is, and how it tells us about the type signature of agents.

Here are some examples of “type signature” questions for specific aspects of agents:

within the world model?Does the agent perform search/optimization within the world model, or in the world directly?Does agent-like behavior imply agent-like internal architecture?## What’s A Selection Theorem?

A Selection Theorem tells us something about what agent type signatures will be selected for in some broad class of environments. Two important points:

everyquestion about agent type signatures; it just needs to tell ussomethingabout agent type signatures.For instance, the

subagents argumentsays that, when our “agents” have internal state in a coherence-theorem-like setup, the “goals” will be pareto optimality over multiple utilities, rather than optimality of a single utility function. This says very little about embeddedness or world models or internal architecture; it addresses only one narrow aspect of agent type signatures. And, like the coherence theorems, it doesn’tdirectlytalk about selection; it just says that any strategy which doesn’t fit the pareto-optimal form is strictly dominated by some other strategy (and therefore we’d expect that other strategy to be selected, all else equal).Most Selection Theorems, in the short-to-medium term, will probably be like that: they’ll each address just one particular aspect of agent type signatures. That’s fine. As long as the assumptions are general enough and realistic enough, we can use lots of theorems together to narrow down the space of possible types.

Eventually, I do expect that most of the core ideas of Selection Theorems will be unified into a small number of Fundamental Theorems of Agency - perhaps even a single theorem. But that’s not a necessary assumption for the usefulness of this program, and regardless, I expect a lot of progress on theorems addressing specific aspects of agent type signatures before then.

## How to work on Selection Theorems

## New Theorems

The most open-ended way to work on the Selection Theorems program is, of course, to come up with new Selection Theorems.

If you’re relatively-new to this sort of work and wondering how one comes up with useful new theorems, here are some possible starting points:

frame, and try to apply it. For example, I’ve been getting surprisingly a lot of mileage out of thecomparative advantage framelately; it turns out to give some neat variants of the Coherence Theorems.Also, take a look at

What’s So Bad About Ad-Hoc Mathematical Definitions?to help build someuseful aestheticintuitions.## Incremental Work

This is work which starts from one or more existing selection theorem(s), and improves on them somehow.

Some starting points with examples where I’ve personally found them useful before:

subagentsidea started from trying (and failing) to apply Coherence Theorems to financial markets.alogical inductor implemented as a market; Iextended thatto show thatanylogical inductor is behaviorally equivalent to a market.Gooder Regulatorwas basically that.A couple other approaches for which I don’t have a great example from my own work, but which I expect to be similarly fruitful:

## Up Next

I currently have two follow-up posts planned:

These are explicitly intended to help people come up with ways to contribute to the Selection Theorems program.