The idea of a “basin of attraction around corrigibility” motivates much of prosaic alignment research. Essentially this is an abstract way of thinking about the process of iteration on AGI designs. Engineers test to find problems, then understand the problems, then design fixes. The reason we need corrigibility for this is that a non-corrigible agent generally has incentives to interfere with this process. The concept was introduced by Paul Christiano:
... a corrigible agent prefers to build other agents that share the overseer’s preferences — even if the agent doesn’t yet share the overseer’s preferences perfectly. After all, even if you only approximately know the overseer’s preferences, you know that the overseer would prefer the approximation get better rather than worse.
Thus an entire neighborhood of possible preferences lead the agent towards the same basin of attraction. We just have to get “close enough” that we are corrigible, we don’t need to build an agent which exactly shares humanity’s values, philosophical views, or so on.
In addition to making the initial target bigger, this gives us some reason to be optimistic about the dynamics of AI systems iteratively designing new AI systems. Corrigible systems want to design more corrigible and more capable successors. Rather than our systems traversing a balance beam off of which they could fall at any moment, we can view them as walking along the bottom of a ravine. As long as they don’t jump to a completely different part of the landscape, they will continue traversing the correct path.
Max Harms wrote about CAST, a similar strategy that relied on the same idea:
This property of non-self-protection means we should suspect AIs that are almost-corrigible will assist, rather than resist, being made more corrigible, thus forming an attractor-basin around corrigibility, such that almost-corrigible systems gradually become truly corrigible by being modified by their creators.
This post is about the reasons I expect this kind of iteration to miss the important problems. I expect empirical iteration on corrigibility to quickly resolve all the detectable problems and then run out of fuel before resolving the main problems. I’ve been occasionally trying to convince people of this for the past two years or so, but recently I’ve had some success talking about this in conversations (specifically with Max Harms and Seth Herd), so I’m hoping my explanation has become clear enough to be readable as a post.
The basin of attraction around corrigibility is a very intuitive idea. It’s only a small extension to the general process of engineering by iterative improvement. This is the way we get almost all technological progress: by building terrible versions of things, watching them break, understanding what went wrong and then trying again. When this is too dangerous or expensive, the first solution is almost always to find workarounds that reduce the danger or expense. The notion of corrigibility is motivated by this; it’s a workaround to reduce the danger of errors in the goals, reasoning or beliefs of an AGI.
Eliezer’s original idea of corrigibility, at a high level, is to take an AGI design and modify it such that it “somehow understands that it may be flawed”. This kind of AGI will not resist being fixed, and will avoid extreme and unexpected actions that may be downstream of its flaws. The agent will want lots of failsafes built into its own mind such that it flags anything weird, and avoids high-impact actions in general. Deference to the designers is a natural failsafe. The designers only need to one-shot engineer this one property of their AGI design such that they know it is robust, and then the other properties can be iterated on. The most important thing they need to guarantee in advance is that the “understanding that it may be flawed” doesn’t go away as the agent learns more about the world and tries to fix flaws in its thinking.
I think of this original idea of corrigibility as being kinda similar to rule utilitarianism.[1] The difficulty of stable rule utilitarianism is that act utilitarianism is strictly better, if you fully trust your own beliefs and decision making algorithm. So to make a stable rule utilitarian, you need it to never become confident in some parts of its own reasoning, in spite of routinely needing to become confident about other beliefs. This isn’t impossible in principle (it’s easy to construct a toy prior that will never update on certain abstract beliefs), but in practice it’d be an impressive achievement to put this into a realistic general purpose reasoner. In this original version there is no “attractor basin” around corrigibility itself. In some sense there is an attractor basin around improving the quality of all the non-corrigibility properties, in that the engineers have the chance to iterate on these other properties.
Paul Christiano and Max Harms are motivated by the exact same desire to be able to iterate, but have a somewhat different notion of how corrigibility should be implemented inside an AGI. In Paul’s version, you get a kind of corrigibility as a consequence of building act-based agents. One version of this is an agent whose central motivation is based around getting local approval of the principal[2] (or a hypothetical version of the principal).
Max’s version works by making the terminal goal of the AGI be empowering the principal and also not manipulating the principal. This loses the central property from the original MIRI paper, the “understanding that it may be flawed”,[3] but Max thinks this is fine because the desire for reflective stability remains in the principal, so the AGI will respect it as a consequence of empowering the principal. There’s some tension here, in that the AI and the human are working together to create a new iteration of the AI, and the only thing holding the AI back from “fixing” the next iteration is that the human doesn’t want that. There’s a strong incentive to allow or encourage the human to make certain mistakes. CAST hopes to avoid this with a strict local preference against manipulation of the principal.
There are several other ideas that have a similar motivation of wanting to make iteration possible and safe. Honesty or obedience are often brought up as fulfilling the same purpose. For example, Paul says:
My overall guess is that it's usually better to just work on ELK, because most likely the core difficulties will be similar and the ELK setting makes it much clearer what exactly we want. But it still seems useful to go back and forth between these perspectives.
(These perspectives feel similar to me because "honestly tell me what's going on" seems like it gets at the core of corrigibility, and lying about sensor tampering seems like it gets at the central corrigibility failure. My guess is that you see this differently, and are thinking about corrigibility in a way that is more tied up with agency itself, which I suspect is a mistake but it will be hard to know until the dust settles.)
I want to distinguish between two clusters in the above ideas: One cluster (MIRI corrigibility, ELK) involves theoretical understanding of the corrigibility property and why it holds. The other cluster (Paul’s act-based corrigibility, the main story given in CAST, and RL-reinforced human-recognizable honesty or obedience) centrally involve a process of iteration as the human designers (with the help of the AGI, and other tools) fix flaws that appear along the way. This second cluster relies heavily on the notion that there is a “basin of attraction” around corrigibility. The second cluster is the one I have a problem with.
The empirical feedback loop depends on being able to understand problems and design fixes. In a situation with a big foreseeable distribution shift, the feedback loop goes away after you patch visible issues and are only left with generalisation issues. This would be fine if we were in an engineering domain where we can clearly see flaws, understand the causes, and predict the effect of our patches on particular generalisation leaps. We are currently not, and it’s not looking like this will change.
Before I go into detail on this I want to set up how I’m thinking about the task of building AGI and the specific distribution shifts that are a barrier to iteration.
Your task is to build a cargo ship. You can test it now, in a lake, and iterate on your design. Your task is to use this iteration process to build a ship, then tell your customer that the ship will survive a storm, after 10 years of wear and tear, in the middle of the ocean. Let’s pretend you have relatively unlimited money and labour so each lake test is cheap, and no one has built a ship like yours before.
Think through how this is going to go. In the early stages, you’ll build ships that leak, or have high drag, or react poorly when loaded in certain ways. All of these are easy to measure in your testing environment. You fix these problems until it’s working perfectly on all the tests you are able to perform. This part was easy.
Most of the difficulty of your task is in modelling the differences between your tests and reality, and compensating for those differences. For example, a storm puts more stress on the structure as a whole than your tests are capable of creating. To compensate for this, you need to make conservative guesses about the maximum stresses that waves are capable of creating, and use this to infer the necessary strength for each component. You’re able to empirically test the strength of each component individually.
Then you need to make guesses about how much components will wear down over time, how salt and sea life will affect everything, plausible human errors, sand, minor collisions etc.[4] All of these can be corrected for, and procedures can be designed to check for early signs of failure along the way. If done carefully you’ll succeed at this task.
The general pattern here is that you need some theoretical modelling of the differences between your tests and reality, across the distribution shift, and you need to know how to adjust your design to account for these differences*.* If you were to only fix the problems that became visible during lake testing, you won’t end up with a robust enough ship. If you don’t really understand each visible problem before fixing it, and instead apply blind design changes until the problem goes away, then you haven’t the faintest hope of succeeding.
We get to test AI systems while they have little control over the situation, have done relatively little online learning, and haven’t thought about how to improve themselves. As we use the AGI to help us do research, and help us design improvements, these things gradually stop being true, bit by bit. A storm in 10 years is analogous to the situation in an AI lab after an AGI has helped run 10 large scale research projects after it was first created, where it has designed and implemented improvements to itself (approved by engineers), noticed mistakes that it commonly makes and learned to avoid them, thought deeply about why it’s doing what it’s doing and potential improvements to its situation, and learned a lot from its experiments. The difference between these two situations is the distribution shift.
Like with the ship, each time we see behavior that seems bad we can try to work out the cause of the problem and design a fix. One difference is that working out the cause of a problem can be more difficult in ML, because we lack much more than guesswork about the relationship between training and distant generalisation. Working out how to fix the issue robustly is difficult for the same reason.
Empirical iteration is particularly bad in the current field of ML, because of how training works. The easiest way to make a problem go away is by adding new training examples of bad behaviour that you’ve noticed and training against it. ML training will make that behaviour go away, but you don’t know whether it fixed the underlying generator or just taught it to pass the tests.
Why did I focus on the particular AI distribution shifts listed in the previous section? Essentially because I can think of lots of ways for these to “reveal” unintended goals that were previously not obvious.[5] If we want to work out what flaws might remain after an empirical iteration process, we need to think through changes to the AGI design that could pass that filter and become a problem later on. So we need to think through in mechanistic detail all the design changes that can cause different goals[6] under shifted conditions but be invisible under development conditions.
This is easier to work through when you’ve thought a lot about the internal mechanisms of intelligence. If you have detailed theories about the internal mechanisms that make an intelligence work, then it’s not that hard to come up with dozens of these. Even though we (as a field) don’t confidently know any details about the internals of future AGIs, visualizing it as a large complicated machine is more accurate than visualizing it as a black box that just works.
So communicating this argument clearly involves speculating about the internal components of an intelligence. This makes it difficult to communicate, because everyone[7] gets hung up on arguments about how intelligence works. But it’s easy to miss that the argument doesn’t depend very much on the specific internal details. So, as an attempt to avoid the standard pitfalls I’ll use some quick-to-explain examples anchored on human psychology. Each of these is an “error” in a person that will make them pursue unexpected goals under more extreme conditions, especially those extreme conditions related to trying to be better.
Similar lists can be found here or here, the first with more focus on how I expect AGI to work, and the second with focus on how Seth Herd expects it to work.
Obviously my examples here are anthropomorphising, but we can do the same exercise for other ways of thinking about internal mechanisms of intelligence. For example AIXI, or hierarchies of kinda-agents, or huge bundles of heuristics and meta-heuristics, or Bayesian utility maximisation, etc.[10] I encourage you to do this exercise for whichever is your favourite, making full use of your most detailed hypotheses about intelligence. My examples are explicitly avoiding technical detail, because I don’t want to get into guesses about the detailed mechanisms of AGI. A real version of this exercise takes full advantage of those guesses and gets into the mathematical weeds.
Here’s one concrete example of the basin of attraction argument: If the AI locally wants to satisfy developer preferences (but in a way that isn’t robust to more extreme circumstances, i.e. it would stop endorsing that if it spent enough time thinking about its desires), then it should alert the developer to this problem and suggest solutions. This gives us the ability to iterate that we want.
The AI may be able to use introspection to help notice some potential problems within itself, but for most of the important distribution shifts it’s in the same position as the AI researcher and is also speculating about consequences of the coming distribution shifts.
There’s a weaker version, where for example the AI has a moderately strong desire to always be truthful, but otherwise ultimately would prefer something other than helping the AI developers. The AI won’t particularly try to find “flaws” in itself, but if asked it’ll tell the truth about anything it has noticed. The humans don’t know how far to trust it, but it seems trustworthy to the limited extent to which they can test for that. In this version, there’s more responsibility resting on the humans, who have to take advantage of this apparent honesty to extract research and understanding to work out how they should iterate.
The main place where I think this story fails is that it doesn’t help much with the iteration loop running out of fuel. Even with the help of the AI, the humans aren’t that good at noticing failure modes on the hard distribution shifts, and aren’t very good at redesigning the training process to robustly patch those failure modes (without also hiding evidence of them if the patch happened to fail). We still lack the theoretical modelling of the distribution shifts, even with an AI helping us. If the AI is to help fix problems before they come up, it would have to do our engineering job from scratch by inventing a more engineerable paradigm,[11] rather than working by small and easily understandable adjustments to the methods used to create it.
If I steelman a case for prosaic alignment research that I’ve heard a few times, it’d go something like this:
We all agree that after iterating for a while we won’t be sure that there are no further errors that are beyond our ability to test for, but still the situation can be made better or worse. Let’s put lots of effort into improving every part of the iteration loop: We’ll improve interpretability so we can sometimes catch non-behavioural problems. We’ll improve our models of how training affects generalized behavior so that we can better guesstimate the effect of changing the training data. These won’t solve the problem in the limit of intelligence, or give us great confidence, but every problem caught and patched surely increases our chances on the margin?
I agree with this, it does increase our chances on the margin, but it misses something important: As we run out of obvious, visible problems, the impact saturates very quickly. We need to decide whether to go down the basin of corrigibility pathway, or stop until we are capable of engineering corrigibility in a way that stands up to the distribution shifts.[12] To make this decision we need to estimate where the risk saturates if we follow the basin of corrigibility approach.
My best approach to estimating the potential for generalization failures[13] is by working through, in detail, all the changes to a hypothetical design of an intelligence that would be undetectable in testing but lead to undesired behaviour after a known distribution shift. Extremely roughly, we can estimate how crazy it would be for each one to go wrong. After thinking about each one, we can get a gut-level estimate based on how many plausible difficult-to-notice-at-development-time errors are plausible. This is the approach I would use to estimate risk if I were building a bridge, skyscraper, moon-lander or cargo ship (if there weren’t better estimates available from past experience).
It’s epistemically tricky to estimate this without bias, and this trickiness is amplified by the holes in my mechanistic understanding of future AGI internals. The task of finding lots of problems like this could lead to overestimating the risk, by directing creativity at finding lots of extremely implausible flaws. This is a way that this kind of reasoning could end up being wrong. On the other hand, there’s a bias in this approach caused by anchoring on the intended mechanisms and relying on creativity to come up with unintended mechanisms that none-the-less pass all training and tests. However, I think it’s possible to calibrate your reasoning here by comparison to the ship analogy (and similar engineering tasks with big distribution shifts) and by calibrating against experience with the ML generalisation.
I’ve worked through this exercise a few times, for different ways of thinking about AGI internal mechanisms, and this provides very rough estimates of the number of “degrees of freedom” in the design of an AGI conditioned on it passing all tests we are capable of performing. The conclusion I’ve drawn from doing this is that this iterative path to AGI development is near guaranteed to fail, in the sense that the intended corrigibility property won’t survive the distribution shifts.
Slow, online adversarial training
One counterargument to the story I’m telling here is to propose slow, online adversarial[14] training. In this approach we try to go through every distribution shift in slow motion, in the hope that we can detect and patch problems as we go. We keep doing this all the way through the feared distribution shifts, so the AGI is actively helping with research and doing its own research. The hope here is that it helps with the “observe problems” side of the empirical feedback loop.
Here’s why I don’t expect it to help very much:
Chain of thought gives us the transparency we need to iterate
Chain of thought ideally allows us to analyze the process that caused unexpected behavior, approximately diagnose what’s going on, and theorise about what fixes might solve the underlying problem. It’s hoped that this helps us with point one in the previous section.
I think this isn’t as good as it looks, and in particular will be worse in the future,[15] for two reasons:
Putting these together
If we’re carefully monitoring the CoT of an AGI as it passes through the distribution shifts[16] then we might see some thoughts that indicate the agent isn’t corrigible or aligned. For example, it might notice habits that it deemed worth removing, then reconsider (in CoT) the priorities of different tasks. This kind of evidence of incorrigibility will usually be somewhat ambiguous, but worrying to some people.
My mainline scenario is that the response to this will be some kind of retraining or patching, making a new AGI that shows dramatically less of whatever evidence was previously worrying. Perhaps this repeats a couple of times, but soon enough the ambiguous evidence is gone and the AGI somewhat more trusted. It goes on to be used for more research, and for this purpose is given more freedom to experiment, explore, learn and self-improve. The exact point at which it stops playing along is hard to predict, so my distribution is very spread out.
There’s one good future that I see coming from the iterative “work with AIs to improve AIs” approach. At some point in the iteration process, the engineers (or AI agents) will realise their ability to spot and fix problems isn’t nearly good enough, and they’ll push for
What kind of legible architecture would be enough to give me optimism? The most bare-bones would be interpretability into the beliefs and desires of an AI, and the structural knowledge to verify that those beliefs and desires are the true beliefs and desires of the AI. It’d be good to be able to distinguish behavioral and strategic heuristics from beliefs about the world, and understand when and why heuristics will be replaced as an agent learns. If the agent is best understood as Bayesian, I want to be able to tell what prior it’s using.
From this starting point, alignment and corrigibility work would be tractable but hard. We’d need to work out what evidential threshold the AI would use before replacing a part of its own algorithm with an “improvement”. We’d need to work out how beliefs and values drift as online learning updates are made. We’d need to work out whether there are adversarial examples that exploit the desires, or that exploit the belief updating procedure. We’d need to become reasonably confident that the prior is “reasonable” and doesn’t lead to weird beliefs, or that there are failsafe mechanisms if it is not. We’d want to somehow confirm that lots of thinking won’t lead to our failsafes breaking or being removed. This work would be tractable because we would have a far greater ability to draw evidence from small experiments (with components of an agent) to the implications about a full general intelligence.
I hope this post has conveyed the main problems I see with iterative approaches to corrigible AGI development, and why the basin of attraction analogy is a misleading way to think about this process. I want to stress that reading arguments like the ones in this post isn’t sufficient to understand which perspective on corrigibility is correct. You have to work through the reasoning using your own examples, and do the exercises using your own most mechanistically detailed models.
There are some controversial beliefs I have that are mainly downstream of the arguments in this post, but also somewhat downstream of other beliefs and arguments that aren’t explained in this post. I’ve briefly stated them in the following dropdown:
Things I believe, mostly as a result of the arguments in this post
Many thanks to Steve Byrnes, Max Harms and Seth Herd for extremely helpful feedback.
I’m probably abusing these definitions a bit, apologies to philosophers. ↩︎
The overseer, or developer. I’m following Max’s terminology. ↩︎
“While the framing of corrigibility as about reflectively-seeing-oneself-as-a-flawed-part-of-a-whole is a standard MIRI-ish framing of corrigibility, I believe that it leans too heavily into the epistemic/architectural direction and not enough on the corrigibility-from-terminal-values direction I discuss in The CAST Strategy. Furthermore, I suspect that the right sub-definition of “robust” will recover much of what I think is good about the flawed-tool frame.” Source. ↩︎
Disclaimer: I don’t know anything about shipbuilding, although I once technically did win an award from the Royal Institute of Naval Architects for my part in building a rowboat. ↩︎
In the shipbuilding analogy, I would come up with things like storms causing unusually high stress on rusted bolts, because it’s the sort of thing that’s difficult to notice in development tests. ↩︎
Or the behavioral appearance of different goals ↩︎
Most people, in my experience ↩︎
I hope it’s not just me that does this. ↩︎
Something like this seems to be true of me. ↩︎
Although beware of shell games. It can be easy with some of these models of intelligence to accidently hide the important capability generators in a black box and then it becomes difficult to imagine ways that the black box might contain poorly designed mechanisms. ↩︎
I’ll discuss this possibility in a later section. ↩︎
Using an approach more like MIRI corrigibility or ELK. ↩︎
The failures that aren’t likely to be caught by patching all visible problems that are detectable during development. ↩︎
In the sense that researchers are actively trying to put the AI in test situations that elicit unintended behavior and train it not to generate that behavior, in parallel to using it to do research and help redesign itself. ↩︎
We’ll probably lose legible chain of thought for various capability-related reasons, but I’ll set that aside. ↩︎
i.e. does self-improvement research and successfully uses the results of this research. ↩︎