Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Duncan Sabien has a post titled How the ‘Magic: The Gathering’ Color Wheel Explains Humanity. Without the context of that post, or other experience with the MtG color wheel, this post will probably not make sense. This post may not make sense anyway. I will use a type of analysis that is sometimes used to talk about humans (and often criticized even when used for humans), but rarely used in any technical subjects. I will abstract so far that everything will start to look like (almost) everything else. I will use wrong categories and stretch facts to make them look like they fit into my ontology.

I will describe 5 clusters of ideas in AI and AI safety, which correspond to the 5 Magic the Gathering colors. Each color will also come along with a failure mode. For each failure mode, the two opposing colors (on the opposite side of the pentagon) form a collection of tools and properties that might be useful for fighting that failure mode.

Mutation and Selection

So I want to make an AI that can accomplish some difficult task without trying to kill me (or at least without succeeding in killing me). Let's consider the toy task of designing a rocket. First, I need a good metric of what it means to be a good rocket design. Then, I need to search over all the space of potential rocket designs, and find one that scores well according to my metric. I claim that search is made of two pieces: Mutation and Selection, Exploration and Optimization, or Babble and Prune.

Mutation and Selection are often thought of as components of the process of evolution. Genes spin off slightly modified copies over time through mutation, and selection repeatedly throws out the genes that score badly according to a fitness metric (that is itself changing over time). The result is that you find genes that are very fit for survival.

However, I claim that mutation and selection are much more general than evolution. Gradient descent is very close to (a speed up of) the following process. Take an initial point called your current best point. Sample a large number of points within an epsilon ball of the current best point. Select the best of the sampled points according to some metric. Call the selected point the new current best point, and repeat.

I am not trying to claim that machine learning is simple because it is mostly just mutation and selection. Rather, I am trying to claim that many of the complexities of machine learning can be viewed as trying to figure out how to do mutation and selection well.

Goodharting is a problem that arrises when extreme optimization goes unchecked, especially when the optimization is much stronger than the process that chose the proxy that was being optimized for.

Similarly, unchecked exploration can also lead to problems. This is especially true for systems that are very powerful, and can take irreversible actions that have not been sufficiently optimized. This could show up as a personal robot accidentally killing a human user when it gets confused, or as a powerful agent exploring into taking actions that destroy themselves. I will refer to this problem as irreversible exploration.


The process described above is how I want to find my rocket design. The problem is that this search is not stable or robust to scale. It is setting up an internal pressure for consequentialism, and if that consequentialism is realized, it might interfere with the integrity of the search.

By consequentialism, I am not talking about the moral framework. It is similar to the moral framework, but it should not have any connotations of morality. Instead I am talking about the process of reasoning about the consequences of potential actions, and choosing actions based on those consequences. Other phrases I may use for to describe this process include agency, doing things on purpose, and back-chaining.

Let's go back to the evolution analogy. Many tools have evolved to perform many subgoals of survival, but one of the most influential tools to evolve was the mind. The reason the mind was so useful was because it was able to optimize on a tighter feedback cycle than evolution itself. Instead of using genes that encode different strategies for gathering food and keeping the ones that work, the mind can reason about different strategies for gathering food, try things, see which ways work, and generalize across domains, all within a single generation. The best way for evolution to gather food is to create a process that uses a feedback loop that is unavailable to evolution directly to improve the food gathering process. This process is implemented using a goal. The mind has a goal of gathering food. The outer evolution process need not learn by trial and error. It can just choose the minds, and let the minds gather the food. This is more efficient, so it wins.

It is worth noting that gathering food might only be a subgoal for evolution, and it could still be locally worthwhile to create minds that reason terminally about gathering food. In fact, reasoning about gathering food might be more efficient than reasoning about genetic fitness.

Mutation and selection together form an outer search process that finds things that score well according to some metric. Consequentialism is a generic way to score well on any given metric: choose actions on purpose that score well according to that metric (or a similar metric). It is hard to draw the line between things that score well by accident, and things that score well on purpose, so when we try to search over things that score well by accident, we find things that score well on purpose. Note that the consequentialism might itself be built out of a mutation and selection process on a different meta level, but the point is that it is searching over things to choose between using a score that represents the consequences of choosing those things. From the point of view of the outer search process, it will just look like a thing that scores well.

So a naive search trying to solve a hard problem may find things that are themselves using consequentialism. This is a problem for my rocket design task, because I was trying to be the consequentialist, and I was trying to just use the search as a tool to accomplish my goal of getting to the moon. When I make consequentialism without being very careful to ensure it is pointed in the same direction as I am, I create a conflict. This is a conflict that I might lose, and that is the problem. I will refer to consequentialism arising within a powerful search process as inner optimizers or daemons.

Boxing and Mildness

This is where AI safety comes in. Note that this is a descriptive analysis of AI safety, and not necessarily a prescriptive one. Some approaches to AI safety attempt to combat daemons and irreversible exploration through structure and restrictions. The central example in this cluster is AI boxing. We put the AI in a box, and if it starts to behave badly, we shut it off. This way, if a daemon comes out of our optimization process, it won't be able to mess up the outside world. I obviously don't put too much weight in something like that working, but boxing is a pretty good strategy for dealing with irreversible exploration. If you want to try a thing that may have bad consequences, you can spin up a sandbox inside your head that is supposed to model the real world, you can try the thing in the sandbox, and if it messes things up in your sandbox, don't try it in the real world. I think this is actually a large part of how we can learn in the real world without bad consequences. (This is actually a combination of boxing and selection together fighting against irreversible exploration.)

Other strategies I want to put in this cluster include formal verification, informed oversight and factorization. By factorization, I am talking about things like factored cognition and comprehensive AI services. In both cases, problems are broken up into small pieces by a trusted system, and the small pieces accomplish small tasks. This way, you never have to run any large untrusted evolution-like search, and don't have to worry about daemons.

The main problem with things in this cluster is that they likely won't work. However, if I imagine they worked too well, and I had a system that actually had these types of restrictions making it safe throughout, there is still a (benign) failure mode which I will refer to as lack of algorithmic range. By this, I mean things like making a system that is not Turing complete, and so can't solve some hard problems, or a prior that is not rich enough to contain the true world.

Mildness is another cluster of approaches in AI safety, which is used to combat Daemons and Goodhart. Approaches in this cluster include Mild Optimization, Impact Measures, and Corrigibility. They are all based on the fact that the world is already partially optimized for our values (or vice versa), and too much optimization can destroy that.

A central example of this is quantilization, which is a type of mild optimization. We have a proxy which was observed to be good in the prior, unoptimized distribution of possible outcomes. If we then optimize the outcome according to that proxy, we will go to a single point with a high proxy value. There is no guarantee that that point will be good according to the true value. With quantilization, we instead do something like choose a point at random, according to the unoptimized distribution from among the top one percent of possible outcomes according to the proxy. This allows us to transfer some guarantees from the unoptimized distribution to the final outcome.

Impact measures are similarly only valuable because the do-nothing action is special in that it is observed to be good for humans. Corrigibility is largely about making systems that are superintelligent without being themselves fully agentic. We want systems that are willing to let human operators fix them, in a way that doesn't result in optimizing the world for being the perfect way to collect large amounts of feedback from microscopic humans. Finally, note that one way to stop a search from creating an optimization daemon is to just not push it too hard.

The main problem with this class of solutions is a lack of competitiveness. It is easy to make a system that doesn't optimize too hard. Just make a system that doesn't do anything. The problem is that we want a system that actually does stuff, partially because it needs to keep up with other systems that are growing and doing things.

New Comment
4 comments, sorted by Click to highlight new comments since:
Other strategies I want to put in this cluster include formal verification, informed oversight and factorization.

Why informed oversight? It doesn't feel like a natural fit to me. Perhaps you think any oversight fits in this category, as opposed to the specific problem pointed to by informed oversight? Or perhaps there was no better place to put it?

Corrigibility is largely about making systems that are superintelligent without being themselves fully agentic.

This seems very different from the notion of corrigibility that is "a system that is trying to help its operator". Do you think that these are two different notions, or are they different ways of pointing at the same thing?

I think informed oversight fits better with MtG white than it does with boxing. I agree that the three main examples are boxing like, and informed oversight is not, but it still feels white to me.

I do think that corrigibility done right is a thing that is in some sense less agentic. I think that things that have goals outside of them are less agentic than things that have their goals inside of them, but I think corrigibility is stronger than that. I want to say something like a corrigible agent not only has its goals partially on the outside (in the human), but also partially has its decision theory on the outside. Idk.

"Finally, note that one way to stop a search from creating an optimization daemon is to just not push it too hard." - An "optimisation demon" doesn't have to try to optimise itself to the top. What about a "semi-optimisation demon" that tries to just get within the appropriate range?

The weirdest part about "an optimization demon" is "this is our measure of good (outcomes), but don't push to hard towards it and you'll get something bad", when intuitively something that is optimizing at our expense would have a harder time meeting stricter constraints.

The reasoning behind it is a) us and b) everything we call brains, being the result of "pushing too hard". It's not immediately clear how a "semi-optimization demon" would come to be, or what that would mean.

It's also not clear how when and how you'd have the issue aside from running a genetic algorithm for ages.