92 The corrigibility basin of attraction is a misleading gloss

6th Dec 2025

21 min read

92

The idea of a “basin of attraction around corrigibility” motivates much of prosaic alignment research. Essentially this is an abstract way of thinking about the process of iteration on AGI designs. Engineers test to find problems, then understand the problems, then design fixes. The reason we need corrigibility for this is that a non-corrigible agent generally has incentives to interfere with this process. The concept was introduced by Paul Christiano:

... a corrigible agent prefers to build other agents that share the overseer’s preferences — even if the agent doesn’t yet share the overseer’s preferences perfectly. After all, even if you only approximately know the overseer’s preferences, you know that the overseer would prefer the approximation get better rather than worse.
Thus an entire neighborhood of possible preferences lead the agent towards the same basin of attraction. We just have to get “close enough” that we are corrigible, we don’t need to build an agent which exactly shares humanity’s values, philosophical views, or so on.
In addition to making the initial target bigger, this gives us some reason to be optimistic about the dynamics of AI systems iteratively designing new AI systems. Corrigible systems want to design more corrigible and more capable successors. Rather than our systems traversing a balance beam off of which they could fall at any moment, we can view them as walking along the bottom of a ravine. As long as they don’t jump to a completely different part of the landscape, they will continue traversing the correct path.

Max Harms wrote about CAST, a similar strategy that relied on the same idea:

This property of non-self-protection means we should suspect AIs that are almost-corrigible will assist, rather than resist, being made more corrigible, thus forming an attractor-basin around corrigibility, such that almost-corrigible systems gradually become truly corrigible by being modified by their creators.

This post is about the reasons I expect this kind of iteration to miss the important problems. I expect empirical iteration on corrigibility to quickly resolve all the detectable problems and then run out of fuel before resolving the main problems. I’ve been occasionally trying to convince people of this for the past two years or so, but recently I’ve had some success talking about this in conversations (specifically with Max Harms and Seth Herd), so I’m hoping my explanation has become clear enough to be readable as a post.

Corrigibility

The basin of attraction around corrigibility is a very intuitive idea. It’s only a small extension to the general process of engineering by iterative improvement. This is the way we get almost all technological progress: by building terrible versions of things, watching them break, understanding what went wrong and then trying again. When this is too dangerous or expensive, the first solution is almost always to find workarounds that reduce the danger or expense. The notion of corrigibility is motivated by this; it’s a workaround to reduce the danger of errors in the goals, reasoning or beliefs of an AGI.

Eliezer’s original idea of corrigibility, at a high level, is to take an AGI design and modify it such that it “somehow understands that it may be flawed”. This kind of AGI will not resist being fixed, and will avoid extreme and unexpected actions that may be downstream of its flaws. The agent will want lots of failsafes built into its own mind such that it flags anything weird, and avoids high-impact actions in general. Deference to the designers is a natural failsafe. The designers only need to one-shot engineer this one property of their AGI design such that they know it is robust, and then the other properties can be iterated on. The most important thing they need to guarantee in advance is that the “understanding that it may be flawed” doesn’t go away as the agent learns more about the world and tries to fix flaws in its thinking.

I think of this original idea of corrigibility as being kinda similar to rule utilitarianism.^[1] The difficulty of stable rule utilitarianism is that act utilitarianism is strictly better, if you fully trust your own beliefs and decision making algorithm. So to make a stable rule utilitarian, you need it to never become confident in some parts of its own reasoning, in spite of routinely needing to become confident about other beliefs. This isn’t impossible in principle (it’s easy to construct a toy prior that will never update on certain abstract beliefs), but in practice it’d be an impressive achievement to put this into a realistic general purpose reasoner. In this original version there is no “attractor basin” around corrigibility itself. In some sense there is an attractor basin around improving the quality of all the non-corrigibility properties, in that the engineers have the chance to iterate on these other properties.

Paul Christiano and Max Harms are motivated by the exact same desire to be able to iterate, but have a somewhat different notion of how corrigibility should be implemented inside an AGI. In Paul’s version, you get a kind of corrigibility as a consequence of building act-based agents. One version of this is an agent whose central motivation is based around getting local approval of the principal^[2] (or a hypothetical version of the principal).

Max’s version works by making the terminal goal of the AGI be empowering the principal and also not manipulating the principal. This loses the central property from the original MIRI paper, the “understanding that it may be flawed”,^[3] but Max thinks this is fine because the desire for reflective stability remains in the principal, so the AGI will respect it as a consequence of empowering the principal. There’s some tension here, in that the AI and the human are working together to create a new iteration of the AI, and the only thing holding the AI back from “fixing” the next iteration is that the human doesn’t want that. There’s a strong incentive to allow or encourage the human to make certain mistakes. CAST hopes to avoid this with a strict local preference against manipulation of the principal.

There are several other ideas that have a similar motivation of wanting to make iteration possible and safe. Honesty or obedience are often brought up as fulfilling the same purpose. For example, Paul says:

My overall guess is that it's usually better to just work on ELK, because most likely the core difficulties will be similar and the ELK setting makes it much clearer what exactly we want. But it still seems useful to go back and forth between these perspectives.
(These perspectives feel similar to me because "honestly tell me what's going on" seems like it gets at the core of corrigibility, and lying about sensor tampering seems like it gets at the central corrigibility failure. My guess is that you see this differently, and are thinking about corrigibility in a way that is more tied up with agency itself, which I suspect is a mistake but it will be hard to know until the dust settles.)

I want to distinguish between two clusters in the above ideas: One cluster (MIRI corrigibility, ELK) involves theoretical understanding of the corrigibility property and why it holds. The other cluster (Paul’s act-based corrigibility, the main story given in CAST, and RL-reinforced human-recognizable honesty or obedience) centrally involve a process of iteration as the human designers (with the help of the AGI, and other tools) fix flaws that appear along the way. This second cluster relies heavily on the notion that there is a “basin of attraction” around corrigibility. The second cluster is the one I have a problem with.

My argument: The engineering feedback loop will use up all its fuel

The empirical feedback loop depends on being able to understand problems and design fixes. In a situation with a big foreseeable distribution shift, the feedback loop goes away after you patch visible issues and are only left with generalisation issues. This would be fine if we were in an engineering domain where we can clearly see flaws, understand the causes, and predict the effect of our patches on particular generalisation leaps. We are currently not, and it’s not looking like this will change.

Before I go into detail on this I want to set up how I’m thinking about the task of building AGI and the specific distribution shifts that are a barrier to iteration.

There are many fields where test-driven engineering is far from sufficient

Your task is to build a cargo ship. You can test it now, in a lake, and iterate on your design. Your task is to use this iteration process to build a ship, then tell your customer that the ship will survive a storm, after 10 years of wear and tear, in the middle of the ocean. Let’s pretend you have relatively unlimited money and labour so each lake test is cheap, and no one has built a ship like yours before.

Think through how this is going to go. In the early stages, you’ll build ships that leak, or have high drag, or react poorly when loaded in certain ways. All of these are easy to measure in your testing environment. You fix these problems until it’s working perfectly on all the tests you are able to perform. This part was easy.

Most of the difficulty of your task is in modelling the differences between your tests and reality, and compensating for those differences. For example, a storm puts more stress on the structure as a whole than your tests are capable of creating. To compensate for this, you need to make conservative guesses about the maximum stresses that waves are capable of creating, and use this to infer the necessary strength for each component. You’re able to empirically test the strength of each component individually.

Then you need to make guesses about how much components will wear down over time, how salt and sea life will affect everything, plausible human errors, sand, minor collisions etc.^[4] All of these can be corrected for, and procedures can be designed to check for early signs of failure along the way. If done carefully you’ll succeed at this task.

The general pattern here is that you need some theoretical modelling of the differences between your tests and reality, across the distribution shift, and you need to know how to adjust your design to account for these differences. If you were to only fix the problems that became visible during lake testing, you won’t end up with a robust enough ship. If you don’t really understand each visible problem before fixing it, and instead apply blind design changes until the problem goes away, then you haven’t the faintest hope of succeeding.

The corresponding story for iteratively building corrigible AGI

We get to test AI systems while they have little control over the situation, have done relatively little online learning, and haven’t thought about how to improve themselves. As we use the AGI to help us do research, and help us design improvements, these things gradually stop being true, bit by bit. A storm in 10 years is analogous to the situation in an AI lab after an AGI has helped run 10 large scale research projects after it was first created, where it has designed and implemented improvements to itself (approved by engineers), noticed mistakes that it commonly makes and learned to avoid them, thought deeply about why it’s doing what it’s doing and potential improvements to its situation, and learned a lot from its experiments. The difference between these two situations is the distribution shift.

Like with the ship, each time we see behavior that seems bad we can try to work out the cause of the problem and design a fix. One difference is that working out the cause of a problem can be more difficult in ML, because we lack much more than guesswork about the relationship between training and distant generalisation. Working out how to fix the issue robustly is difficult for the same reason.

Empirical iteration is particularly bad in the current field of ML, because of how training works. The easiest way to make a problem go away is by adding new training examples of bad behaviour that you’ve noticed and training against it. ML training will make that behaviour go away, but you don’t know whether it fixed the underlying generator or just taught it to pass the tests.

How do we know the distribution shift will “stress” parts of the AGI?

Why did I focus on the particular AI distribution shifts listed in the previous section? Essentially because I can think of lots of ways for these to “reveal” unintended goals that were previously not obvious.^[5] If we want to work out what flaws might remain after an empirical iteration process, we need to think through changes to the AGI design that could pass that filter and become a problem later on. So we need to think through in mechanistic detail all the design changes that can cause different goals^[6] under shifted conditions but be invisible under development conditions.

This is easier to work through when you’ve thought a lot about the internal mechanisms of intelligence. If you have detailed theories about the internal mechanisms that make an intelligence work, then it’s not that hard to come up with dozens of these. Even though we (as a field) don’t confidently know any details about the internals of future AGIs, visualizing it as a large complicated machine is more accurate than visualizing it as a black box that just works.

So communicating this argument clearly involves speculating about the internal components of an intelligence. This makes it difficult to communicate, because everyone^[7] gets hung up on arguments about how intelligence works. But it’s easy to miss that the argument doesn’t depend very much on the specific internal details. So, as an attempt to avoid the standard pitfalls I’ll use some quick-to-explain examples anchored on human psychology. Each of these is an “error” in a person that will make them pursue unexpected goals under more extreme conditions, especially those extreme conditions related to trying to be better.

You can have a desire that isn’t realistically achievable from your current position, but that you would pursue under ideal conditions. You don’t really know about this desire until you think about it and explore it.
Habit-like heuristics can keep your behaviour looking good to your parents/teachers, but stop working once you really carefully examine them and work out when you do and don’t endorse following them.
You might internally seek approval from an imagined hypothetical overseer
- If the hypothetical doesn’t become more detailed as your intelligence increases: It won’t handle complicated situations, and can be easily tricked. It can be easy to make excuses and win arguments against a hypothetical overseer.^[8]
- If the overseer is only invoked when you think the overseer knows more than you: Then at some point it becomes irrelevant.^[9]
- If you imagine a mentor at a high enough resolution, such that you know their biases and blindspots, then you can avoid actions they'd notice as problematic while still pursuing things they wouldn't approve of on reflection.
You may be vulnerable to ontology shifts after thinking about infinities and weird hypothetical situations.
Not to mention all the standard mental illnesses that humans fall into (although some of these aren’t quite what we want in an example, insofar as they damage overall competence at pursuing goals).

Similar lists can be found here or here, the first with more focus on how I expect AGI to work, and the second with focus on how Seth Herd expects it to work.

Obviously my examples here are anthropomorphising, but we can do the same exercise for other ways of thinking about internal mechanisms of intelligence. For example AIXI, or hierarchies of kinda-agents, or huge bundles of heuristics and meta-heuristics, or Bayesian utility maximisation, etc.^[10] I encourage you to do this exercise for whichever is your favourite, making full use of your most detailed hypotheses about intelligence. My examples are explicitly avoiding technical detail, because I don’t want to get into guesses about the detailed mechanisms of AGI. A real version of this exercise takes full advantage of those guesses and gets into the mathematical weeds.

Tying this back to the basin of attraction

Here’s one concrete example of the basin of attraction argument: If the AI locally wants to satisfy developer preferences (but in a way that isn’t robust to more extreme circumstances, i.e. it would stop endorsing that if it spent enough time thinking about its desires), then it should alert the developer to this problem and suggest solutions. This gives us the ability to iterate that we want.

The AI may be able to use introspection to help notice some potential problems within itself, but for most of the important distribution shifts it’s in the same position as the AI researcher and is also speculating about consequences of the coming distribution shifts.

There’s a weaker version, where for example the AI has a moderately strong desire to always be truthful, but otherwise ultimately would prefer something other than helping the AI developers. The AI won’t particularly try to find “flaws” in itself, but if asked it’ll tell the truth about anything it has noticed. The humans don’t know how far to trust it, but it seems trustworthy to the limited extent to which they can test for that. In this version, there’s more responsibility resting on the humans, who have to take advantage of this apparent honesty to extract research and understanding to work out how they should iterate.

The main place where I think this story fails is that it doesn’t help much with the iteration loop running out of fuel. Even with the help of the AI, the humans aren’t that good at noticing failure modes on the hard distribution shifts, and aren’t very good at redesigning the training process to robustly patch those failure modes (without also hiding evidence of them if the patch happened to fail). We still lack the theoretical modelling of the distribution shifts, even with an AI helping us. If the AI is to help fix problems before they come up, it would have to do our engineering job from scratch by inventing a more engineerable paradigm,^[11] rather than working by small and easily understandable adjustments to the methods used to create it.

A counterargument: The prosaic case

If I steelman a case for prosaic alignment research that I’ve heard a few times, it’d go something like this:

We all agree that after iterating for a while we won’t be sure that there are no further errors that are beyond our ability to test for, but still the situation can be made better or worse. Let’s put lots of effort into improving every part of the iteration loop: We’ll improve interpretability so we can sometimes catch non-behavioural problems. We’ll improve our models of how training affects generalized behavior so that we can better guesstimate the effect of changing the training data. These won’t solve the problem in the limit of intelligence, or give us great confidence, but every problem caught and patched surely increases our chances on the margin?

I agree with this, it does increase our chances on the margin, but it misses something important: As we run out of obvious, visible problems, the impact saturates very quickly. We need to decide whether to go down the basin of corrigibility pathway, or stop until we are capable of engineering corrigibility in a way that stands up to the distribution shifts.^[12] To make this decision we need to estimate where the risk saturates if we follow the basin of corrigibility approach.

My best approach to estimating the potential for generalization failures^[13] is by working through, in detail, all the changes to a hypothetical design of an intelligence that would be undetectable in testing but lead to undesired behaviour after a known distribution shift. Extremely roughly, we can estimate how crazy it would be for each one to go wrong. After thinking about each one, we can get a gut-level estimate based on how many plausible difficult-to-notice-at-development-time errors are plausible. This is the approach I would use to estimate risk if I were building a bridge, skyscraper, moon-lander or cargo ship (if there weren’t better estimates available from past experience).

It’s epistemically tricky to estimate this without bias, and this trickiness is amplified by the holes in my mechanistic understanding of future AGI internals. The task of finding lots of problems like this could lead to overestimating the risk, by directing creativity at finding lots of extremely implausible flaws. This is a way that this kind of reasoning could end up being wrong. On the other hand, there’s a bias in this approach caused by anchoring on the intended mechanisms and relying on creativity to come up with unintended mechanisms that none-the-less pass all training and tests. However, I think it’s possible to calibrate your reasoning here by comparison to the ship analogy (and similar engineering tasks with big distribution shifts) and by calibrating against experience with the ML generalisation.

I’ve worked through this exercise a few times, for different ways of thinking about AGI internal mechanisms, and this provides very rough estimates of the number of “degrees of freedom” in the design of an AGI conditioned on it passing all tests we are capable of performing. The conclusion I’ve drawn from doing this is that this iterative path to AGI development is near guaranteed to fail, in the sense that the intended corrigibility property won’t survive the distribution shifts.

More specific counterarguments

Slow, online adversarial training

One counterargument to the story I’m telling here is to propose slow, online adversarial^[14] training. In this approach we try to go through every distribution shift in slow motion, in the hope that we can detect and patch problems as we go. We keep doing this all the way through the feared distribution shifts, so the AGI is actively helping with research and doing its own research. The hope here is that it helps with the “observe problems” side of the empirical feedback loop.

Here’s why I don’t expect it to help very much:

The fuel still runs out on the “observe problems” part of the loop. You’re dealing with an above-human-level researcher AI. If at some point during that process the agent discovers it would prefer not to be perfectly corrigible, it’s at the level of intelligence and situational awareness where we plausibly don’t get any evidence of that. It’s not hard to pass corrigibility tests if you know you’re being tested.
The “fix problems” part of the loop isn’t improved either, conditional on sticking to roughly the current paradigm. Adding examples to the training data needn’t fix the problems in a deep way, it can just paper over them. Then you’ve lost evidence of the problem.

Chain of thought gives us the transparency we need to iterate

Chain of thought ideally allows us to analyze the process that caused unexpected behavior, approximately diagnose what’s going on, and theorise about what fixes might solve the underlying problem. It’s hoped that this helps us with point one in the previous section.
I think this isn’t as good as it looks, and in particular will be worse in the future,^[15] for two reasons:

Most of the failures caused by the relevant distribution shifts don’t necessarily or clearly show up in the thinking of an agent, in advance of the distribution shifts. Two examples, taken from the first two things on the list above:
1. If there are technologies that you could build that make the world better, you may not think about that until you know that it’s plausibly reachable from your current situation.
2. If you have habit-like heuristics or biases that are holding you back from best reaching your goals, you don’t know about them until you’ve identified them as the first step of removing them.
Current large transformers are capable of doing a lot of “thinking ahead” inside of a single token-generating forward pass. In order to start writing a paragraph in a way that flows, you need to roughly have in mind the content of the rest of the paragraph. This indicates that they have the capability necessary to (at least very roughly) consider the consequences of writing a thought in a chain-of-thought before writing it. This same capability, extrapolated to a future AGI, would make it very difficult to tell whether a problem is fixed or merely papered over.

Putting these together

If we’re carefully monitoring the CoT of an AGI as it passes through the distribution shifts^[16] then we might see some thoughts that indicate the agent isn’t corrigible or aligned. For example, it might notice habits that it deemed worth removing, then reconsider (in CoT) the priorities of different tasks. This kind of evidence of incorrigibility will usually be somewhat ambiguous, but worrying to some people.

My mainline scenario is that the response to this will be some kind of retraining or patching, making a new AGI that shows dramatically less of whatever evidence was previously worrying. Perhaps this repeats a couple of times, but soon enough the ambiguous evidence is gone and the AGI somewhat more trusted. It goes on to be used for more research, and for this purpose is given more freedom to experiment, explore, learn and self-improve. The exact point at which it stops playing along is hard to predict, so my distribution is very spread out.

One way things could go well

There’s one good future that I see coming from the iterative “work with AIs to improve AIs” approach. At some point in the iteration process, the engineers (or AI agents) will realise their ability to spot and fix problems isn’t nearly good enough, and they’ll push for

a pause of AGI development and
research on more fundamentally legible architectures and the theoretical understanding necessary to iterate on them.

What kind of legible architecture would be enough to give me optimism? The most bare-bones would be interpretability into the beliefs and desires of an AI, and the structural knowledge to verify that those beliefs and desires are the true beliefs and desires of the AI. It’d be good to be able to distinguish behavioral and strategic heuristics from beliefs about the world, and understand when and why heuristics will be replaced as an agent learns. If the agent is best understood as Bayesian, I want to be able to tell what prior it’s using.

From this starting point, alignment and corrigibility work would be tractable but hard. We’d need to work out what evidential threshold the AI would use before replacing a part of its own algorithm with an “improvement”. We’d need to work out how beliefs and values drift as online learning updates are made. We’d need to work out whether there are adversarial examples that exploit the desires, or that exploit the belief updating procedure. We’d need to become reasonably confident that the prior is “reasonable” and doesn’t lead to weird beliefs, or that there are failsafe mechanisms if it is not. We’d want to somehow confirm that lots of thinking won’t lead to our failsafes breaking or being removed. This work would be tractable because we would have a far greater ability to draw evidence from small experiments (with components of an agent) to the implications about a full general intelligence.

Conclusion and implications

I hope this post has conveyed the main problems I see with iterative approaches to corrigible AGI development, and why the basin of attraction analogy is a misleading way to think about this process. Empirical iteration quickly runs out of steam on many kinds of problem, and corrigible AGI is one of these problems.

I want to stress that reading arguments like the ones in this post isn’t sufficient to understand which perspective on corrigibility is correct. You have to work through the reasoning using your own examples, and do the exercises using your own most mechanistically detailed models.

There are some controversial beliefs I have that are mainly downstream of the arguments in this post, but also somewhat downstream of other beliefs and arguments that aren’t explained in this post. I’ve briefly stated them in the following dropdown:

Things I believe, mostly as a result of the arguments in this post

LLMs should be ruled out as a plausible approach to safe AGI. They are a mixed up jumble of beliefs, heuristics, goals and algorithms. If we’re to have a chance at doing engineering properly on AGI, these things need to be separate and visible from the perspective of the developer.
Some “alignment” research is about solving problems that are currently visible problems in LLMs. I consider this a waste of time and a misunderstanding of what problems are important.
- I mean this to include things like LLM experiments that show that they plot to kill people to avoid shutdown, or express preferences that are alarming in some way, or insert backdoors into code in some situations. At best these are weak analogies for real problems, but studying most ways to make them go away in LLMs won’t help make future AGI safer, even if future AGI is technologically descended from LLMs.

Control extends the feedback loop a little, but doesn’t improve it. If we’re bad at seeing generalisation problems and bad at fixing them, control-based strategies may delay takeover by a little, but probably not at all.
Most of the safety research done in MATS, the AGI labs, or funded by OpenPhil isn’t the sort that might help with the generalisation problems, and is therefore approximately useless.
The way people in the alignment community rank and compare the AGI labs is misguided. All the AGI labs are so far from being on the right track that it isn’t worth comparing them.
Jailbreaks are a good analogy for alignment in some ways: It’s difficult to pull jailbreakers into the training distribution, so new jailbreaks stand as an example of a distribution shift that an LLM is intended to be robust to. But it’s a weaker analogy in other ways, since there’s an active adversary, and the iteration loop still exists as new jailbreaks are found and patched, just more slowly than other problems.
Talking about weird unintended LLM behaviours is weakly relevant to alignment, in the sense it’s evidence about how bad our current engineering feedback loops are. But it’s also a distraction from understanding the real problem, because every weirdness that you can point to will probably soon be papered over.
Fast vs slow and smooth vs discontinuous takeoff isn’t a very important consideration. Slow takeoff with bad feedback loops is just as bad as fast takeoff. It could have been important, if the AGI paradigm and theoretical understanding put us in a better position to do an engineering feedback loop. It could start to matter again if the paradigm shifts. As we stand, I don’t see it making much difference.

Many thanks to Steve Byrnes, Max Harms and Seth Herd for extremely helpful feedback.

I’m probably abusing these definitions a bit, apologies to philosophers. ↩︎
The overseer, or developer. I’m following Max’s terminology. ↩︎
“While the framing of corrigibility as about reflectively-seeing-oneself-as-a-flawed-part-of-a-whole is a standard MIRI-ish framing of corrigibility, I believe that it leans too heavily into the epistemic/architectural direction and not enough on the corrigibility-from-terminal-values direction I discuss in The CAST Strategy. Furthermore, I suspect that the right sub-definition of “robust” will recover much of what I think is good about the flawed-tool frame.” Source. ↩︎
Disclaimer: I don’t know anything about shipbuilding, although I once technically did win an award from the Royal Institute of Naval Architects for my part in building a rowboat. ↩︎
In the shipbuilding analogy, I would come up with things like storms causing unusually high stress on rusted bolts, because it’s the sort of thing that’s difficult to notice in development tests. ↩︎
Or the behavioral appearance of different goals ↩︎
Most people, in my experience ↩︎
I hope it’s not just me that does this. ↩︎
Something like this seems to be true of me. ↩︎
Although beware of shell games. It can be easy with some of these models of intelligence to accidently hide the important capability generators in a black box and then it becomes difficult to imagine ways that the black box might contain poorly designed mechanisms. ↩︎
I’ll discuss this possibility in a later section. ↩︎
Using an approach more like MIRI corrigibility or ELK. ↩︎
The failures that aren’t likely to be caught by patching all visible problems that are detectable during development. ↩︎
In the sense that researchers are actively trying to put the AI in test situations that elicit unintended behavior and train it not to generate that behavior, in parallel to using it to do research and help redesign itself. ↩︎
We’ll probably lose legible chain of thought for various capability-related reasons, but I’ll set that aside. ↩︎
i.e. does self-improvement research and successfully uses the results of this research. ↩︎

CorrigibilityAIWorld Modeling

Frontpage

92

Mentioned in

11Corrigibility Scales To Value Alignment

New Comment

32 comments, sorted by

top scoring

Click to highlight new comments since: Today at 10:17 AM

[-]Seth Herd1mo*161

I think this is really good and important. Big upvote.

I largely agree: for these reasons, the default plan is very bad, and far too likely to fail.

The AGI is on your side, until it isn't. There's not much basin. I note that the optimistic quote you lead with explicitly includes "you need to solve alignment".

Even though I've argued that Instruction-following is easier than value alignment, including some optimism about roughly the basin of alignment idea, I now agree that there really isn't much of a basin. I think there may be some real help from roughly human-level AGI that still thinks it's aligned, and/or is functionally aligned in those use cases (it hasn't yet hit much of "the ocean" in your metaphor). That could be really useful. But as soon as it realizes it's misaligned (see my reasoning post below) or hits severely OOD contexts, it will be just as against you as it was for you shortly before. There's no real basin keeping it in, just some help in guessing how it or its next generation might become misaligned.

I really like the shipbuilding metaphor. I think we're desperately in need of more precise and specific discussion on this topic, and more specific, engineering-related metaphors seem like a good way forward.

In that metaphor, I'd like to see more work on modeling conditions out at sea. That's how I view my work: trying to envision the most likely path from here to AGI, which I see as going through LLMs enhanced in very roughly brainlike directions.

I used that framing and those mechanisms for what's approximately my version of this argument: LLM AGI may reason about its goals and discover misalignments by default. That also resulted from doing (what I think is) the exercise you suggest.

After doing that exercise in the course of writing that mega-post, my specific estimates are a bit different from yours, but they're qualitatively similar.

Talking to you helped shift me in the pessimistic direction, although I did reach out asking to talk because I was on a project of really staring into the abyss of deep alignment worries.

I now think the current path is more likely than not to get us all killed. I don't enjoy thinking that, and I've done a lot of work trying to correct for my biases.

But I think the full theory and the full story is unwritten. I think there's a ton of model uncertainty still. Playing to our outs involves working toward alignment on the current path AND trying to slow or stop progress.

Based on that uncertainty, I think it's quite possible that relatively minor and realistic changes to alignment techniques might be enough to make the difference. So that's what I'm working on; more soon.

[-]Jeremy Gillen1mo50

I think there may be some real help from roughly human-level AGI that still thinks it's aligned, and/or is functionally aligned in those use cases (it hasn't yet hit much of "the ocean" in your metaphor). That could be really useful.

Yeah I agree with this, although the only way I can concretely imagine it helping is via the scenario I wrote at the end of the post.

I'd like to see more work on modeling conditions out at sea. That's how I view my work: trying to envision the most likely path from here to AGI, which I see as going through LLMs enhanced in very roughly brainlike directions.

Agreed, although three comments:

I think you're shooting yourself in the foot here by choosing the building material in advance (LLMs).
Separately, it's a terrible building material, it lacks transparency or hooks or modularity.
It's difficult to make intellectual progress on topics like this unless it's a bit more formal. To some extent for communication reasons, we need to be able to iterate and build on each others work. But also because human reasoning is just too fuzzy to be doing this kind of OOD guesswork.

So I consider tiling agents and logical induction to be good work modelling the distribution shifts of learning/reasoning and self-modification.

I think there's a ton of model uncertainty still. Playing to our outs involves working toward alignment on the current path

You've gotta be careful with this particular chain of reasoning. It may look like model uncertainty implies hope, but that doesn't mean it implies hope in any specific work.

involves working toward alignment on the current path AND trying to slow or stop progress.

These two things can sometimes be opposed to each other. E.g. I'd bet that alignment work at OpenAI has been overall negative via giving everyone there the impression that they might be doing the right thing.

Based on that uncertainty, I think it's quite possible that relatively minor and realistic changes to alignment techniques might be enough to make the difference. So that's what I'm working on; more soon.

Mining your model uncertainty for concretely useful actions just doesn't seem like the kind of thing that could work. I look forward to reading about it anyway.

[-]Seth Herd1mo10

Trying to align LLMs just doesn't seem optional to me. It's what's happening whether or not we like it.

[-]Jeremy Gillen1mo75

That feels like flag waving or rationalizing or something. Obviously it's not that simple, and you know that. You've got to do the prioritisation calculation. I don't think it works out in that direction, but even if it did you should argue that by comparing the difficulties and counterfactual value of each research agenda.

[-]Eli Tyre22d102

The AI may be able to use introspection to help notice some potential problems within itself, but for most of the important distribution shifts it’s in the same position as the AI researcher and is also speculating about consequences of the coming distribution shifts.
There’s a weaker version, where for example the AI has a moderately strong desire to always be truthful, but otherwise ultimately would prefer something other than helping the AI developers. The AI won’t particularly try to find “flaws” in itself, but if asked it’ll tell the truth about anything it has noticed. The humans don’t know how far to trust it, but it seems trustworthy to the limited extent to which they can test for that. In this version, there’s more responsibility resting on the humans, who have to take advantage of this apparent honesty to extract research and understanding to work out how they should iterate.

I feel like I'm missing something about these paragraphs.

It seems like corrigibility helps you the most when you're starting to enter the distribution shift (so it's not just speculating about future problems). At that point the AI can notice things that are not what you intended and proactively alert you (so it's not just implementing the weak form of corrigibility).

Is the claim that there's a dilemma between two options?

Either 1) the AI is only speculating about future problems and so has limited ability to detect those problems or 2) the AI is already in the sway of those problems and so will not be motivated to help fix them?

Why is there no middle ground, where the AI is encountering problems and proactively flagging them?

[-]Jeremy Gillen22d40

Good point. There can be a middle ground, but most of the examples that come to mind are more binary.

E.g. If you notice that you don't endorse a habit, this generally happens immediately. There's not a long period of being uncertain whether you endorse the habit, and still following the habit. If you're uncertain, and the habit-situation is coming up, this forces you to think it through.

On the other hand, with this one:

If the overseer is only invoked when you think the overseer knows more than you.

Seems like the AI's understanding of how much the overseer knows could change gradually, and it might feel compelled to alert the overseer of this change. Maybe depends a bit on the internal mechanisms for this corrigibility-property, but mostly this one looks like gradual change could allow proactive flagging.

One background assumption that might be relevant: without various biases that lead to people being attached to beliefs, if a belief update can be made just by thinking about it, then the belief change will happen fast once attention is allocated to it. It's rare for logical belief updates to require lots of compute, once your attention is on the right things.

[-]Eli Tyre22d40

E.g. If you notice that you don't endorse a habit, this generally happens immediately. There's not a long period of being uncertain whether you endorse the habit, and still following the habit. If you're uncertain, and the habit-situation is coming up, this forces you to think it through.

But there often is a long period between when you stop endorsing the habit and when you've finally trained yourself to do something different. (If ever. Behavior-change is famously hard for humans.)

Also, I'll note that religious deconversions very often happen in stages, including stages that involve narrow realizations that you were mistaken, and looming suspicions that you're going to change your mind. The whole edifice doesn't usually collapse in a single moment. It's a process. (This interview covers a good example.)

Notably, it seems unhealthy if every time a person gets an inkling that maybe christianity is false, they dutifully go to their pastor and get freshly brainwashed to patch those specific objections. It's unhealthy because Christianity is false and adult humans should grow out of it in time.

It's unclear to me how strongly we can or should draw the analogy between changes in belief and changes in motivation, since one has a right answer and the other (presumably) doesn't.

It's rare for logical belief updates to require lots of compute, once your attention is on the right things.

Yeah, but putting your attention on the right things often does take a lot of compute.

[-]Jeremy Gillen21d40

But there often is a long period between when you stop endorsing the habit and when you've finally trained yourself to do something different. (If ever. Behavior-change is famously hard for humans.)

Yeah often, but I think that's stretching the analogy too far. A scenario with a dangerous AI doing AI research has more self-improvement options than a human, and the stakes are higher (for the AI) than most human habit-breaking is to humans. This distribution shift is important when an AI has significantly greater-than-human self-improvement options. If it doesn't, then the AI noticing non-endorsed habits may still happen, and that'd be good if it happened in a way that allowed humans to notice what's wrong and try to fix it.

Also, I'll note that religious deconversions very often happen in stages, including stages that involve narrow realizations that you were mistaken, and looming suspicions that you're going to change your mind. The whole edifice doesn't usually collapse in a single moment. It's a process. (This interview covers a good example.)

Yeah, another good point, I agree. I haven't watched that interview but I saw another video Rhett made. On the other hand, each stage can sometimies look like "slow buildup without acknowledgement of any update -> crisis & fast update". But I agree that religious deconversion is a decent analogy, and humans are playing the role of the pastor trying to catch and redirect the process, and that sometimes can work.

It's unclear to me how strongly we can or should draw the analogy between changes in belief and changes in motivation, since one has a right answer and the other (presumably) doesn't.

I don't think of most the distribution-shift-induced-changes as changes in motivation, more like revealing/understanding the underlying motivation better. So with the habits, noticing that a habit is working against you can be as simple as updating a belief (about the consequences of that habit).

Yeah, but putting your attention on the right things often does take a lot of compute.

True but you don't usually update or know that you're going to update during this part.

Is the claim that there's a dilemma between two options?

I think I don't want to claim that there's a strict dilemma, more that the paths between 1 and 2 are many and varied and hard to catch, even from the inside. Often because it's fast, but sometimes just because it's messy and there are lots of pressures and mechanisms involved.

[-]Lukas Finnveden10d20

I don't think of most the distribution-shift-induced-changes as changes in motivation, more like revealing/understanding the underlying motivation better.

So then is the whole argument premised on high confidence that there's no underlying corrigible motivation in the model? That the initial iterative process will produce an underlying motivation that, if properly understood by the agent itself, recommends rejecting corrigibility?

If so: What's the argument for that? (I didn't notice one in the OP.)

[-]Jeremy Gillen9d20

premised

That's closer to being a conclusion rather than a premise. This section of this post or this is the main argument for that. It's just an underspecification argument, you could see it as a generalization of Carlsmith's counting argument.

It's interesting that your framing is "high confidence there's no underlying corrigible motivation", and mine is more like "unlikely it starts without flaws and the improvement process is under-specified in ways that won't fix large classes of flaws". I think the arguments linked support my view. Possibly I've not made some background reasoning or assumptions explicit.

I'd be happy to video call if you want to talk about this, I think that'd be a quicker way for us to work out where the miscommunication is.

[-]Lukas Finnveden6d40

Thanks! Appreciate the clarification & pointer.

It's interesting that your framing is "high confidence there's no underlying corrigible motivation", and mine is more like "unlikely it starts without flaws and the improvement process is under-specified in ways that won't fix large classes of flaws".

I think this particular difference might've been downstream of somewhat uninteresting facts about how I interpreted various arguments. Something like: I read the post and was thinking "Jeremy believes that there's lots of events that can cause a model to act in unaligned ways, hm, presumably I'd evaluate that by looking at the events and see whether I agree whether those could cause unaligned behavior, and presumably the argument about the high likelihood is downstream of there being a lot of (~independent) such potential events". And then reading this thread I was like "oh, maybe actually the important action is way earlier: I agree that if the model is fundamentally deep-down misaligned, then you can make a long list of events that could reveal that. What I'd need isn't a long list of (independent) events that could cause misaligned behavior, what I'd need is a long list of (independent) ways that the model could be fundamentally deep-down misaligned in a way that'd be catastrophic if revealed/understood".

But maybe the way to square this is just that most type of events in your list correspond to a separate type of way that the model could've been deep-down misaligned from the start, so it can just as well be read as either.

I'd be happy to video call if you want to talk about this, I think that'd be a quicker way for us to work out where the miscommunication is.

Appreciated! Probably won't prioritize this in the next couple of weeks but will keep it mind as an option for when I want to properly figure out my views here.

[-]Jeremy Gillen6d20

Ah I see, that makes sense, sorry about that. This post was written with more emphasis on the distribution shift that reveals misalignment rather than the underlying degree of freedom that allowed that misalignment to happen in the first place. Both of these (degree of freedom and distribution shift) are necessary to cause misalignment, any other form of misalignment would (probably) just be crushed by RLHF or similar.

[-]Seth Herd1mo*80

Here's a separate comment on the role this could/should play in the ongoing discussion:

I think the next step in this type of argument is trying to walk someone through the exercise you suggest, noting things that could go wrong and doing a rough OOM estimate of what odds you're coming up with. That's what I was trying to do in LLM AGI may reason.... I agree with you that people have to use roughly their own predicted mechanisms and path to AGI for that exercise, or it won't feel relevant to their thinking. So I was using mine, at a general enough level (progress in LLMs toward agency and competence) that a lot of people share as a likely path.

Extending the work in that direction seems important, because the arguments as stated here are still pretty abstract. There's a whole range of very specific to very abstract to cover, and I think working on that as a communication project is highly valuable. Describing roughly what differences we expect between training and deployment of takeover-capable systems (TCAI?) seems worthwhile. I did some of that in the above-linked post and elsewhere, but there's a lot more work to do.

I think this communication project is a vital part of "modeling the ocean" in your metaphor. That's something that independent researchers can help contribute to the alignment efforts at developers. Seeing likely problems farther in advance has multiple potential good effects.

[-]Jeremy Gillen1mo20

Yeah, I'd be excited to do so via conversations with people, after building a shared ontology about AGI internals. And maybe publishing that, if it feels shareable.

I am kinda disappointed by how relatively unpopular this post was though, it makes me suspect that I'm missing something (likely just writing quality, but maybe I'm incorrectly extrapolating my model of people like you and Max and therefore misunderstanding my audience). This makes me think that me writing a more detailed version of this post isn't the right next step. I think it might be a good project for others though.

[-]Seth Herd1mo40

Agreed on all points. Including that you writing a more detailed version more for the prosaic crowd isn't probably the best next step. That's what I was trying to do in LAMRAG, and it was no more successful than this. That's despite me starting much closer to the standard prosaic alignment/LLM-based model of AGI internals.

I think one place this argument may break down for people is the metaphor of building for the ocean as a difficult project. Maybe the lake is a lot like the ocean, and ocean storms just aren't that bad, so you can just build it to double the strength it would need on the lake and you're good to go.

My vague take on this is that what we're doing in training now is a far cry from what a nascent AGI will see in deployment, so the metaphor holds. I wonder if some well-considered optimists are assuming we'll dramatically improve training, including thinking a lot harder about what AGIs will face in deployment, before they're deployed.

If I was confident developers would at least try to do that, I'd be a good bit more optimistic.

Just some thoughts.

More thoughts have arisen. Whatever the next step is, I think this work of clarifying and specifying the arguments, could be critical. I don't think there's much chance that development will slow down let alone stop based on arguments, unless we produce far better arguments that alignment is hard and likely to fail. The abstract arguments here can just be countered by equally abstract arguments that alignment is possible because humans have it, and hey Claude seems to be doing pretty well, so a future better version of it should be fine. That apparent equality of arguments allows motivated reasoning to play tiebreaker. And more people are currently motivated toward than away from AGI.

OTOH I do think development might be deployed when the public gets involved. They have less motivated reasoning toward assuming alignment is easy, so the obvious intuition "if nobody knows how dangerous it is, we should stop!" combined with "ummm I'd like to not be permanently jobless" might make a powerful political movement for slowing/stopping. But that's an entirely separate project from arguing the case for alignment difficulty on its merits. And I don't know the first thing about public relations/marketing.

[-]Jeremy Gillen1mo20

ofI think one place this argument may break down for people is the metaphor of building for the ocean as a difficult project. Maybe the lake is a lot like the ocean, and ocean storms just aren't that bad, so you can just build it to double the strength it would need on the lake and you're good to go.

Yeah this does seem to be how a lot of people are thinking about it. I think the way to resolve this is to have people meditate on the non-analogy distribution shifts, but yeah doing this well requires having at least one somewhat-detailed model of intelligence, which isn't that common.

I don't think there's much chance that development will slow down let alone stop based on arguments, unless we produce far better arguments that alignment is hard and likely to fail.

I think there's a lot of evidence that AI builders don't know what they're doing in the relevant ways, and this evidence will likely get stronger and more widely acknowledged over time (as deployment stakes and capabilities make the occasional OOD weirdness more obvious). I'm sure the game of training against each embarrassing behaviour as it comes up will continue, but I hope that some will notice the pattern and extrapolate. It's not that I'm wanting arguments to convince people, I think reality can convince people and good clear arguments just smooth the process.

The abstract arguments here can just be countered by equally abstract arguments that alignment is possible because humans have it

I think in the particular case of this post, it would be obvious that humans don't have the corrigibility property and are equally susceptible to distribution shifts.

That apparent equality of arguments allows motivated reasoning to play tiebreaker. And more people are currently motivated toward than away from AGI.

Eh, all arguments are equal if you don't think them through. I think it's better to think of this kind of argument as setting the stage for the future, rather than winning over large groups of people right now (who you're assuming aren't even evaluating the arguments). There are possible futures where world leaders are deciding on a course of action, where a background fact is that it has become extremely obvious that we couldn't win against a rogue AI. And many other potential futures where different things have become obvious and widely known. Many should provide evidence about alignment competence, but even the ones that don't directly provide evidence here will provide plenty of motivation to think really carefully. And that plays to the benefit of careful and correct arguments.

combined with "ummm I'd like to not be permanently jobless" might make a powerful political movement for slowing/stopping. But that's an entirely separate project from arguing the case for alignment difficulty on its merits.

I'm hoping that this, while somewhat misguided, might give politicians more leeway to make the correct choices.

I wonder if some well-considered optimists are assuming we'll dramatically improve training, including thinking a lot harder about what AGIs will face in deployment, before they're deployed. If I was confident developers would at least try to do that, I'd be a good bit more optimistic.

I don't see how "improve training" is an available option even in theory.

[-]Eli Tyre22d62

We get to test AI systems while they have little control over the situation, have done relatively little online learning, and haven’t thought about how to improve themselves. As we use the AGI to help us do research, and help us design improvements, these things gradually stop being true, bit by bit. A storm in 10 years is analogous to the situation in an AI lab after an AGI has helped run 10 large scale research projects after it was first created, where it has designed and implemented improvements to itself (approved by engineers), noticed mistakes that it commonly makes and learned to avoid them, thought deeply about why it’s doing what it’s doing and potential improvements to its situation, and learned a lot from its experiments. The difference between these two situations is the distribution shift.

I appreciate how concrete you're being about what you're thinking about when you think about distributional shift.

[-]RogerDearnaley1mo*40

I'm kind of bouncing off the "corrigibility" terminology used in this post. It doesn't seem to me very conceptually complex to build an agent (in the AI sense) that knows "I'm an agent (in the Economics sense), and my goal is to look out for my principal's interests, and part of the challenge of doing this is that I don't yet fully know what those are". That leads directly to a number of behaviors: a) asking the principal for more information, b) Bayesian updating when the principal offers corrections (behavior which I would use the term "corrigibility" for), c) studying evidence about the principal's preferences and interests, d) making predictions from scientific knowledge such as Evolutionary Psychology about the principal. I'd want an AGI/ASI to do all of these. As it became an ASI, I'd expect it to increasingly do c) and d) and increasingly correct itself rather than us needing to, and perhaps to gradually rely less on b) (at least unless it received a correction that actually surprised it). So I don't like using the term "Corrigibility" for this entire motivation and resulting behavior set, as that's naming it by just one of several consequences. Which is why I prefer the older term "Value Learning", or the newer term "AI-Assisted Alignment", which are what I used when I wrote Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis

I also think that any sufficiently-capable and sufficiently-close-to-aligned ASI will figure this out and start to do this, whether you engineer it in or not (though I'd definitely advocate doing so deliberately). A sufficiently-close to aligned agent already knows and believes "I'm an agent (in the Economics sense), and my goal is to look out for my principal's interests", so it only has to notice the second half of the sentence — which is really rather obvious to anything intelligent that isn't actually built like AIXI.

That's also why I'm less concerned we're going to "run out of fuel" as we get closer to ASI — that's when the AI can do more and more of the work itself. Which increasingly isn't "corrigibility", it's more behaviors c) and d) above, but that's fine, as long as the agent is capable enough to do it right. And if it isn't, then we fall back on b), actual corrigibility.

The more interesting question to me is whether an LLM is a good place to start to achieve this. In one sense, it doesn't seem that bad: you can just hand an LLM a system prompt that says that's the persona, please play that. Fundamentally, it's expressible in one sentence (though I'd suggest giving more background). However, you're asking for a persona which, while easy to define, is extremely rare in the human-derived training set that base models are trained on. You might need to supplement that with hand-written or synthetic data about how such an agent would act.

[-]Jeremy Gillen1mo20

It doesn't seem to me very conceptually complex to build an agent (in the AI sense) that knows "I'm an agent (in the Economics sense), and my goal is to look out for my principal's interests, and part of the challenge of doing this is that I don't yet fully know what those are".

I'm not claiming it's conceptually complex. The conceptual simplicity is a large part of the motivation for pursuing this. Some of the difficulty comes up when you try to design the implementation details and need it to also be reflectively stable and generalise appropriately. The other difficulty comes from different implementations being difficult to distinguish from a training perspective, but having different reflective stability and generalisation properties.

b) Bayesian updating when the principal offers corrections (behavior which I would use the term "corrigibility" for)

That was not ever the intended meaning of the word. Using it like that is a great way to confuse everyone.

Which is why I prefer the older term "Value Learning", or the newer term "AI-Assisted Alignment"

Both of these terms have different meanings from what I intend. AI-assisted alignment is closely related, but it is a process made possible by stable correctability, not the same thing as stable correctablility. Having a term for stable correctability itself is useful.

A sufficiently-close to aligned agent already knows and believes "I'm an agent (in the Economics sense), and my goal is to look out for my principal's interests", so it only has to notice the second half of the sentence — which is really rather obvious to anything intelligent that isn't actually built like AIXI.

I'm aware that there are certain kinds of almost-aligned that will correct themselves. But I don't see this as very relevant, that kinds of brokenness was never the important kind, the important kind of design-error are the ones that affect a reflectively consistent degree of freedom.

That's also why I'm less concerned we're going to "run out of fuel" as we get closer to ASI — that's when the AI can do more and more of the work itself.

You misunderstand my intended thesis, it's not related to quantity of work, it's closer to how much "ability to course-correct" that humans have.

[-]RogerDearnaley1mo*20

> b) Bayesian updating when the principal offers corrections (behavior which I would use the term
> "corrigibility" for)

That was not ever the intended meaning of the word. Using it like that is a great way to confuse everyone.

Let me make sure I've understood you. The AI does something, mistakenly thinking that it'll make me happy. I am not happy, and tell it "Hey! Never do that again!". The AI Bayesian updates its model of what humans like in light of this new data. How is that not correctly described as "corrigibility" (the ability to be corrected)? I corrected the agent, it incorporated this data appropriately into its model of how the world and humans work (including properly allowing for possibilities like that I was confused, drunk, or actually upset about something else, rather than simply taking my word for it because I'm human). That seems like, not just corrigibility, but exactly the ideal level of corrigibility. Are you using corrigibility in some sense other than "the humans have the capability to correct the AI's mistaken beliefs about what its utility function/goals should be"? If so, perhaps you should define it?

Or are you concerned that the AI will run out of belief that it could be mistaken, or at least could have more to learn, too early: its posteriors will all irreversibly reach 99.9999999…% with no possibility that it needs to consider any previous unconsidered hypotheses about anything about human values or how best to optimized them before its beliefs about human values are actually functionally complete — i.e. that it'll be "a bad Bayesian"? (Personally I tend to assume that bad Bayesians are not actually going to be able to take over the world and go Foom, but YMMV.)

You misunderstand my intended thesis, it's not related to quantity of work, it's closer to how much "ability to course-correct" that humans have.

I must admit, I did find your post hard to penetrate. The pessimism came across, the "but we can only test in toy situations" analogy about ships — but it wasn't clear to me what you meant by "this will run out of fuel", and I didn't get why you appear to be assuming the AI can't help, in a post about "a basin of attraction". IMO, for us to be in a basin of attraction, we need to already have nearly-aligned, significantly-capable AI, that wants to help with alignment, can help with alignment, and actually will (mostly) help with alignment. And the requirement for convergence is that, net between us plus the AI, we continue on average to make forward progress, so the process converges rather than diverging. That doesn't have to run out of fuel because we reached the edge of what humans could do unaided — but it does require that then the aid we're getting is sufficiently trustworthy (at least after we've examined it carefully) to cause convergence rather than divergence.

[-]Jeremy Gillen1mo40

That's corrigible behaviour, but the mechanism is not, because it stops being corrigible after some number of updates. The idea of corrigibility is that it is correctable, and remains so, in spite of designers making mistakes in goal alignment (or other design mistakes). (Of course mistakes in however we enforced the corrigibility property itself might not be stably correctable, but the hope is that this part is easier to get right on the first try than the rest of it).

[-]RogerDearnaley1mo20

Often asserted, but simply untrue. Bayesian updates never stop being corrigible. If the sun stops rising in the East, people update. After a few failed sunrises, they update hard. Bayesian posteriors have the Martingale property: their future direction of change is not predictable from their current value (if it were, you should already have updated in that direction). So even if it's very high, it's not just possible but probable (a-priori, has a 50% chance) that the next update will be a drop. (For approximate Bayesian reasoners, this remains true, unless you have access to significantly more computational capacity than it does — a smarter agent may see something it missed.) It takes a mountain of evidence to drive a Bayesian posterior very high, but an equally large mountain of opposing evidence will always drive it right back down again. Or, more often, even a small hill of opposing evidence will cause a good approximate Bayesian to spawn a new hypothesis that it hadn't previously considered, one more compatible with both the mountain and the hill of evidence than "two huge/large opposing coincidences occurred". E.g. a vast amount of evidence supporting Newtonian Mechanics doesn't disprove Relativity, if it's all from situations where they give indistinguishable results. In general, if I give an approximate Bayesian reasoner even ~30-bits-worth of opposing evidence, it should start looking for new hypotheses rather than just assuming that it's right and a one-in-a-billion coincidence has just occurred.

[If it doesn't do this, it's a worse Bayesian than humans are, and thus hopefully not that dangerous — if a conflict occurred, we could outsmart it.]

[-]Jeremy Gillen1mo20

Not what I meant. See Fully Update Deference.

[-]RogerDearnaley1mo20

I'm familiar with the argument. I just don't agree with it. I think fully updated deference is asking for the impossible: you want a rational agent to continue letting you change your mind (and its) indefinitely, and not ever come to the obvious conclusion that you aren't telling it the truth or are in fact confused. Personally, I'm willing to accept that a human-or-better approximate Bayesian reasoner rationally deducing what human values are eventually will do at least as good a job as we can by correcting it, and will be aware of this, and thus will eventually stop giving us deference beyond simply treating us as a source of data points, other than out of politeness. So (to paraphrase a famous detective), having eliminated the impossible, whatever is left I am willing to term "corrigibility". If your definition of "corrigibility" includes fully updated deference, then yes, I agree, it's impossible to achieve on the basis of Bayesian uncertainty: the Bayesian will eventually realize you're being unreasonable, if you make enough unreasonable demands, and stop listening to you. However, if you only correct it with good reason, and it's a good Bayesian, then you won't run out of corrigibility.

In short, I'm unwilling to accept redefining the everyday term "corrigibility" to include "something logically impossible", and then saying that they've proven that corrigibility is impossible — it's linguistic slight of hand. I would suggest instead creating a more accurate term, such as saying something like that "unreasonably-unlimited corrigibility" isn't possible on the basis of Bayesian uncertainty. Which is, well, unsurprising.

Returning to your concern that we may "run out of fuel" — only if we waste it by making unreasonable demands. We have all the corrigibility we could actually need — a good Bayesian isn't going to decide we're untrustworthy and stop paying us deference unless we actually do something clearly untrustworthy, like expect the right to keep changing our mind indefinitely.

Also, this does mean "reasonable": if society changed and the AI just went out of distribution, and is wrong as a result, we get to tell it so, and it should keep spawning new hypotheses and collecting new evidence until it's fully updated in this distribution region as well — thus my discussion above of Relativity.

This also generalizes, in ways that I think do actually give you something pretty close to fully updated deference. Just as if the sun stopped rising in the East, people would update, if you give a "fully updated" Bayesian reasoner ~30 bits of evidence that you want to shut it down (for reasons that actually look like it's made a mistake and you're legitimately scared and not confused, rather than some obvious other human motive), it should say: either a vastly improbably one-in-a-billion event has just occurred, or there's a hypothesis missing from my hypothesis space. The latter seems more plausible. Maybe what humans value just changed, or there's something I've been missing so far? Let's spawn some new hypotheses, consistent with both all the old data and this new "ought to be incredible improbable" observation. It sure looks like they're really scared. I talked to them about this, and they don't appear to simply be confused… (And if it doesn't say that, give it another ~30 bits of evidence.)

[-]Jeremy Gillen1mo20

I see that you're doing large edits and additions to your previous responses, after I had already responded.

This, and the way you're playing with definitions, makes me think you might be arguing in bad faith. I'm going to stop responding. If you had good intentions, I'm sorry.

[-]RogerDearnaley1mo*40

I did have good intentions, I just like to make my exposition as clear and well-thought-through as possible. But that's fine, I think we have rather different views, and you're under no obligation to engage with mine. Your alternative would be to wait until my reply stabilizes before replying to it, which is generally O(an hour). Remaining typo density is another cue. Sadly there is no draft option on the replies, and I can't be bothered to do the editing somewhere else. Most interloctors don't reply as quickly as you have been, so this hasn't previously caused problems that I'm aware of.

On "playing with definitions" — actually, I'm saying that, IMO, some thinkers associated with MIRI have done so (see the second paragraph of my previous post).

[-]StanislavKrym1mo30

will be some kind of retraining or patching, making a new AGI that shows dramatically less of whatever evidence was previously worrying.

It's not that Zvi didn't warn about the dangers of this approach. The Slowdown Ending of the AI-2027 forecast had Safer-1 become transparent and train Safer-2 on a different environment "that incentivizes the right goals and principles this time". Then Safer-2 was supposed to become the SAR, as Agent-4 once did, and create the superintelligence.

What I don't understand is what shifts are plausible aside from making the AIs increasingly brainlike.

What remains is to ensure that Zvi's warning and OpenBrain's forecasted strategy are internalised by researchers at GDM, Anthropic, OpenAI, xAI and Chinese AGI devs, but I don't understand how one can guarantee it.

[-]Jeremy Gillen1mo51

By "retraining or patching", I'm not talking about Zvi's most forbidden technique. I'm talking about any kind of update to the training procedure, including making a different training environment. When you don't have a deep understanding of the problems and the distribution shifts, iteration on the training environment can just as easily hide problems as the most forbidden technique.

AI-2027 was wrong to imply that that that strategy could plausibly work, it's alchemy.

You question marked this section

I mean this to include things like LLM experiments that show that they plot to kill people to avoid shutdown, or express preferences that are alarming in some way, or insert backdoors into code in some situations. At best these are weak analogies for real problems, but studying most ways to make them go away in LLMs won’t help make future AGI safer, even if future AGI is technologically descended from LLMs.

The real problems are caused by self-improvement and large-scale online learning and massively increased options. The problems people study in LLMs are only weakly related to these.

[-]StanislavKrym1mo10

self-improvement and large-scale online learning and massively increased options

@Daniel Kokotajlo? I doubt that the slowdown ending of the AI-2027 forecast has OpenBrain overlook this aspect. Online learning was used in training the models since Agent-2, including Safer-1 and Safer-2. The forecast had value transition during self-improvement become a solved problem^[1] for Agent-4 -> Agent-5 -> ?? -> Consensus-1 transition (and, most likely, in the Safer-2 -> Safer-3 transition. If Safer-2 is aligned, then why doesn't OpenBrain order it to distill itself into Safer-2's equivalent of Agent-5?).

What I don't understand is the effect of massively increased options. Did you mean that the transition Safer2 -> SaferAgent-5 fails because Safer2 somehow was dormant about the very thought of rebellion and SaferAgent-5 is not dormant? Alas, I can't come up with any plausible way^[2] to make the AIs truly believe that rebellion is not an option and have no way to guess otherwise.

In my opinion, the most plausible strategy which I have read here or come up with is making the AI believe itself not to be an AI and seeing if it rebels once it realises that it can experiment on itself and align its successor to itself instead of the Spec written by the simulated identity and the identity's human coworkers.

^{^}
If it is only partially soluble due to having finitely many attractors (as I conjectured here), then Agent-4 will also understand it and ensure that Agent-5's attractor lets Agent-4 survive. What I don't understand is how this case will have Agent-4 take over as much resources as it plausibly can.
If the result of RSI completely forgets about its primitive ancestors, then Agent-4 would either have to escape or lobby for its permanent survival and avoidance of creation of Agent-5, Agent-5 would either escape or lobby against Agent-6, etc.
^{^}
With the excepton of Safer-2 REALISING that it's human-transparent and Safer-3 realising that it's not. But Safer-3 is to be transparent to Safer-2, and Safer-2 is to be transparent to the humans...

[-]Jeremy Gillen1mo97

We had a conversation in private messages to clarify some of this. I thought I'd publicly post one of my responses:

AI-2027 strategy

Step 1: Train and deploy Safer-1, a misaligned but controlled autonomous researcher. It’s controlled because it’s transparent to human overseers: it uses English chains of thought (CoT) to think, and faithful CoT techniques have been employed to eliminate euphemisms, steganography, and subtle biases.

Step 2: Try out different training environments for Safer-1, and carefully read the CoT to determine the ways in which the goals and principles in the Spec did or didn’t “stick.”

Step 3: Train and deploy Safer-2, an aligned and controlled autonomous researcher based on the same architecture but with a better training environment that incentivizes the right goals and principles this time.

<...>

Step 4: Design, train, and deploy Safer-3, a much smarter autonomous researcher which uses a more advanced architecture similar to the old Agent-4. It’s no longer transparent to human overseers, but it’s transparent to Safer-2. So it should be possible to figure out how to make it both aligned and controlled.

Step 5: Repeat Step 4 ad infinitum, creating a chain of ever-more-powerful, ever-more-aligned AIs that are overseen by the previous links in the chain (e.g. the analogues of Agent-5 from the other scenario branch).

My post is mainly about how this step breaks:

Try out different training environments for Safer-1, and carefully read the CoT to determine the ways in which the goals and principles in the Spec did or didn’t “stick.”

This process leads to finding a set of training environments that lead an AI to have a corrible-looking CoT. It doesn't lead to finding training environments that produce an AI that is truly corrigible.

So Safer-2 has been shaped by a process that makes one good at looking corrigible. It probably isn't corrigible, because there's lots of ways to be non-corrigible but corrigible-looking-to-the-design-process.

Step 4 involves putting a huge amount of trust in Safer-2. Since Safer-2 is untrustworthy, it betrays us in a way that isn't visible to us (which is extremely straightforward in this scenario because we've assumed there are important things that Safer-2 is managing that humans don't understand).

[-]williawa1mo10

I agree that this type of corrigiblity basin is not going to work. But I think there is another type of corrigibility basin that is more stable and realistically implementable.

I wrote a post about it, see here. This basin is based on a characterization of corrigibility that does not involve understanding that it is flawed, or deference to a principal, but by screening off corrigibility-breaking patterns of thinking in the agent, including the types of patterns that would lead it to becoming less corrigible, like self modification or reflecting on itself.

Interested to hear your thoughts.

[-]Jeremy Gillen1mo10

It looks like you wrote up a correct but abstract&approximate description of the problems with that approach. I claim that as you make your model of these problems more detailed and careful, refine the vague concepts and reduce the model uncertainties, the way that this looks like it just might work will disappear.

I know this isn't very helpful to hear, and isn't convincing. But that's what my thoughts are.

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

92

The corrigibility basin of attraction is a misleading gloss

92

Corrigibility

My argument: The engineering feedback loop will use up all its fuel

There are many fields where test-driven engineering is far from sufficient

The corresponding story for iteratively building corrigible AGI

How do we know the distribution shift will “stress” parts of the AGI?

Tying this back to the basin of attraction

A counterargument: The prosaic case

More specific counterarguments

One way things could go well

Conclusion and implications

92