Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Epistemic Strategies of Selection Theorems

7johnswentworth

New Comment

Two things I'd especially like to highlight in this post:

Fundamentally, structural constraints give us back some of the guarantees of the main epistemic strategies of Science and Engineering that

get lost in alignment: we don’t have the technology yet, but we have some ideas of how it will work.

This is possibly the best one-sentence summary I've seen of how these sorts of theorems would be useful.

One corollary of recovering (some of) the usual science-and-engineering strategies is that selection theorems would open the door to a lot of empirical work on alignment and agency. Thus the importance of this section:

Proving structural selection theorem

- Choose a selection mechanism to investigate.
- Find a structural constraint that should be favored by the mechanism.
- Prove the theorem.

- Show that agents with these structural constraints are easier to find.

Show that many agents without the structural constraints can be easily found by the selection pressure.- Show that agents with structural constraints are a majority.

Show that there isn’t a majority of selected-for agents with structural constraints.- Show that agents with structural constraints are easier to sample.

Argue that the set of selected-for agents is different that the one used in the work, and that for the actual set, sampling agents without structural constraints becomes simpler.- Propose a sampling of agents and show it results in structural constraints with high-probability.

Show that the proposed sampling disagrees with what the selection pressure actually finds (showing that the probabilities are different, or that one can sample agents that the other can’t).Checking that the selection theorem applies

- Check that the selection exists.

- For a mechanism, check that it fits with how selection happens.

Show that the actual selection works differently than the mechanism described, and that these differences influence massively what is selected in the end.

These are all potential ways to *empirically test* various kinds of selection theorems.

## Introduction: Epistemic Strategies Redux

This post examines the epistemic strategies of John Wentworth’s

selectiontheoremposts.(If you want to skim this post, just read the Summary subsection that display the different epistemic strategies as design patterns)I introduced the concept in

a recent post, but didn’t define them except as the “ways of producing” knowledge that are used in a piece of research. If we consider a post or paper as a computer program outputting (producing) knowledge about alignment, epistemic strategies are the underlying algorithm or, even more abstractly, thedesign patterns.An example of epistemic strategy, common in natural sciences (and beyond), is

More than just laying out some abstract recipe, analysis serves to understand how each step is done, whether that makes sense, and how each step (and the whole strategy) might fail. Just like a design pattern or an algorithm, it matters tremendously to know when to apply it and when to avoid it as well as subtleties to be aware of.

Laying this underlying structure bare matters in three ways:

Game Programming Patternswho does exactly that for game programmingNow, before starting, I need to point out that the selection theorems posts don’t present research results; they present epistemic strategies (the eponymous theorems). Does that mean my job has already been done? Not exactly: John’s posts do present that epistemic strategy, but not in all the ways I want to stress out. John is also trying to fill in a lot of concrete details and to convince people that selection theorems are a nice thing to research, which I don’t have to do. Instead, you can see this post as distilling the structure of selection theorems and interrogating them further as ways of producing knowledge.

(I use the word “agent” to stay coherent with John, but nothing in the epistemic strategy itself requires agency, and so finding the idea of agents confusing shouldn’t be an issue for reading this post)Thanks to John Wentworth for feedback on a draft of this post.## Characterizing Selection Theorems

Selection theorems are theorems. Obviously. But what sort of theorems? What are they trying to find about the world?

John summarizes the whole class of results in the following way:

This gives us three components of a selection theorem: the selection pressure, the class of environments considered and the constraint on the agent (what John calls the “type signatures”). Let’s get into each, looking for what can fill the corresponding hole in the general selection theorem.

## Selection

A selection theorem is first and foremost about selection. Not just

selection mechanisms(low-level processes like natural selection) but alsoselection criteria(abstract conditions likeno Dutch-booking). The former state how selection happens, whereas the other just characterize the sort of things that will be selected.One of the differences is that a selection mechanism implies a selection criterion, either implicitly (natural selection) or explicitly (ML training with an actual loss function); whereas a selection criterion doesn’t necessarily come with a mechanism.

Still, both mechanisms and criteria come in a wide variety -- what makes a good one for a selection theorem? Making the selection theorem applicable to the real world situation we care about. The next section focuses on this topic of application, but in summary:

mechanisms must fit actual selection processes in the situation, whereas criteria must come with an explanation of why they would be instantiated (possibly a corresponding selection mechanism, but not necessarily).It’s also less obvious what makes a “good” criterion, because of the risk to assume the constraint we want to show in the selection criterion itself.

## Environments

I find John’s formulation above unfortunate, because it doesn’t stress enough how the “broad class of environment”

is part of the hypothesis of a selection theorem, not the conclusion. The intuition here is that we need enough variety to instantiate the selection pressure or criterion. Selection let’s you force the agent’s hand, but only if you can instantiate the situations you need.For a selection mechanism, this amounts to containing

the sort of situations where the mechanism will push in the right direction and be strong enough(for example predation pushing natural selection forward). For a selection criterion, it is about includingthe situations that take advantage of every suboptimality in the agent(like the exploitative bets punishing suboptimality in no Dutch-Booking)Note though that while a broader class of environments might be necessary for proving the theorems, it makes applying it more difficult by putting more conditions on the environments in the real world setting.

There is thus a trade-off between making it possible to prove the theorem (more environments) and making it possible to apply it (less environments). We thus want as small a set of environments as possible while still being large enough to leverage the selection.## Constraint on agents

In the original post, John takes pains to split agents’ type signatures into different components and to explain how they interact with each other. At the level we’re seeing stuff though, we only need to understand that type signatures are

necessary conditionson the agents coming from the selection: if an agent is to be selected, it must do X (or do X with high probability).What sort of conditions do selection theorems show? Here we have a discrepancy between what selection theorems historically prove and what John wants to get out of them.

Existing selection theoremsonly prove behavioral necessary conditions: you must act like this (as incoherence theorems) or you must be able to do that (as inthe Gooder Regulator theorem). On the other hand,what we truly want are structural necessary conditions— for example “you must have a separate world model with this interface and these properties”. John’sthird poston selection theorems is all about how he wants that.Indeed, structural constraints tell you not only that the system must solve the problem, but how it will do so. Alignment just becomes easier if we have knowledge of the internal structure of the system: we can make more pointed predictions about how it might be unaligned; we might use this structure for more concrete alignment schemes. Fundamentally, structural constraints give us back some of the guarantees of the main epistemic strategies of Science and Engineering that

get lost in alignment: we don’t have the technology yet, but we have some ideas of how it will work.I’ll go into more detail about proving structural constraints in the next section, but for the moment just note that this is the sort of thing we want.

## Summary

Selection theorems thus have the following structure:

Hypotheses(Selection pressure)Some means of selection, either a mechanism or a criterion.(Environments)A class of environments broad enough to instantiate the selection pressure in needed way, but small enough to still apply the theorem to real world settings.Conclusion(Necessary Condition on Agents)Some property (ideally structural but maybe behavioral) that is guaranteed for all selected agents, or at least with high probability.## Proving Selection Theorems

## Behavioral constraints

Existing selection theorems only prove behavioral constraints — that is, they only show that the agents must be behaviorally equivalent to a specific class (like EU maximizers in

coherence theorems) or that they must be able solve a specific problem (remembering all relevant data in theGooder Regulator theorem).How to prove selection theorems for behavioral constraints? Looking at

the existing theorems, the first thing to notice is that they tend to use selection criteria. It makes sense, as they tend to be proved backwards: looking at the necessary condition on agents, what criterion selects only agents behaving like this?It doesn’t mean such theorems are trivial or useless,

just that they tell us which criterion selects for the necessary condition, not what is selected by some selection pressure.## Structural constraints

Here instead of criteria, mechanisms are favored. This is mostly because we want to show that some process (natural selection and/or ML training) leads to structural constraints, not find criteria for structural constraints.

Note that we should expect any mechanism to find some good ad-hoc agent without the structural properties; selection theorems for structural constraints can thus only give probabilistic guarantees. They say “out of the agents favored by this selection mechanism, most/almost all will have these structural properties”.

Here are some epistemic strategies to argue that the typical agents selected by a selection theorem on behavior alone should in expectation have additional structural constraints. The list isn’t exhaustive, and I expect these strategies to be combined when actually arguing for such structural constraints.

(Agents with these structural constraints are easier to find)Especially with a selection mechanism, we can argue for properties of the selected agents that are easier to find.For example, John

arguesthat robust and broad optima are easier to find and retain through mechanisms for selection like gradient descent or natural selection, and proposes that these optima might correspond mostly to agents with modular structures.(Agents with structural constraints are a majority)If we can show that most of the selected-for agents have these structural constraints, that is some evidence that we should expect that structure. Not as strong as with an explanation of why these would be favored though.Note that this applies both to mechanisms and criteria.

(Agents with structural constraints are easier to sample)I alreadydescribedthis epistemic strategy, in relation with Alex Turner’s work onConvergent Subgoalsand a comparison toSmoothed Analysis. Basically, if one can show that the agents without structural properties are so rare that they correspond to very steep high peaks in a mostly flat landscape, all but very few sampling of selected-for agents will end up satisfying the structural properties.(Proposed sampling gets structural constraints with high-probability)If we can propose a sampling method for agents, argue that it indeed fits with how the selection pressure eventually samples (like the proposalherefor SGD), and show this sampling to find in expectation agents with the structural constraint, that’s a very strong argument for assuming this structure.## Summary

Proving selection theorems use the following epistemic strategies:

Proving behavioral selection theoremProving structural selection theorem## Applying Selection Theorems

Even pure mathematicians don’t prove theorems only for the joy of the proof: the value of a theorem often comes from what it shows and where it can be applied. The same holds in alignment, with the additional difficulty that we want to apply it to real world systems and situations, not only to other abstractions. This means we need to understand when we can apply selection theorems and what we can learn from that application.

## Requirements of selection theorems

First thing first: selection theorems require

the existence of selection. Once again quite obvious, but it becomes more interesting if we dig into the subtleties.How to argue for the existence of selection depends on whether the theorem uses a mechanism or a criterion.

(Mechanism)The question is whether the mechanism actually happens in the real world application. Answering this question can go from trivial (we know ML training happens because we’re the one implementing it) to yielding epistemic strategies used for showing selection happens (like the arguments for natural selection).(Criterion)An additional difficulty with a criterion is that we need to justify that selection along this criterion indeed happens. That doesn’t necessarily mean providing a full selection mechanism, but we need at least reasons for why this would happen.Most selection theorems using criteria (like

coherence theorems) propose a high-level selection mechanism for this purpose.The other requirement lies on environments.

Not only do we need the variety of environments over which selection is taking place, but environments also need to fit the mold assumed in selection theorems.Coherence theoremsfor example require that bets can be defined in the environments with the required properties, and that the space of bets considered contains the dutch-booking strategies for any suboptimal policy. TheGooder Regulator Theoremhas more concrete requirements in terms of the underlying causal structure, and the same sort of variety constraint on the “tasks” that the agent has to solve.## Interpreting the application of selection theorems

Once we are confident the selection theorem applies in our concrete setting, we can reap its fruits. But what are those fruits? At first glance, they’re obvious: the necessary conditions stated in the theorem! Yet anyone who ever applied a theorem to a real world setting knows how perilous that task is.

How do you make sense of the necessary conditions in your setting for example? You need to find a way of grounding the constraints on agents you get out of the theorem.

This is where most applications of theorems to real world settings go wrong, in my opinion. Yet this is also the part I have the least to say about, because I just don’t have some nice epistemic strategy to check that some conclusions taken from applying a theorem to situation S actually make sense. I’ve seen people do that move, I’ve made it myself, but I don’t have a nice description of the underlying algorithm. So

let’s flag that as an open problem for the time being.## Summary

Proving selection theorems use the following epistemic strategies:

Checking that the selection theorem appliesInterpreting the theorem after applying itOpen Epistemic Strategy Problem## Breaking Selection Theorems

Last but not least, analyzing an epistemic strategy tells us where it can go wrong. The analogy to think here is of falsification: this is a standard and strong way of trying to break a scientific model. What does that look like for selection theorems?

Let’s use the summary design patterns of the previous section, and for each one, finding issues/criticisms/ways of breaking that step.

Proving behavioral selection theoremFind a counterexample (agent selected by criterion but not satisfying the necessary condition; or subset of enough agents to break probabilistic condition).Find an error in the proof.Proving structural selection theoremShow that many agents without the structural constraints can be easily found by the selection pressure.Show that there isn’t a majority of selected-for agents with structural constraints.Argue that the set of selected-for agents is different that the one used in the work, and that for the actual set, sampling agents without structural constraints becomes simpler.Show that the proposed sampling disagrees with what the selection pressure actually finds (showing that the probabilities are different, or that one can sample agents that the other can’t).Checking that the selection theorem appliesShow that the actual selection works differently than the mechanism described, and that these differences influence massively what is selected in the end.Argue that the posited high-level selection mechanism for the criterion doesn’t exist or that it doesn’t push towards the criterion.Show that the concrete environments don’t fit the constraints of the theorem.Show that the concrete environment lacks some situations that are needed to make the proof hold.Interpreting the theorem after applying itLastly, in addition to criticizing a specific application of the theorem, we might argue that the theorem cannot be applied to the wanted setting, or that it doesn’t make sense to conclude what is wanted from it. This amounts to the points above, with the twist of arguing that it’s impossible instead of just breaking the argument at some joint.

This obviously fails to list all possible ways of critiquing a selection theorem and its application. You might have noted that I didn’t say anything about interpreting the necessary condition once the theorem is applied; indeed, without understanding the epistemic strategy involved, it’s harder to get to the core.

Still, any criticism and feedback along these lines would be directly useful to the researcher (John or someone else) proposing a new selection theorem and/or applying one. My claim is that using the design pattern above helps in providing feedback, by drawing attention to the most important parts of the epistemic strategies involved.