Why Corrigibility is Hard and Important (i.e. "Whence the high MIRI confidence in alignment difficulty?")

Eliezer Yudkowsky; So8res

(Adding an edit: This whole comment is about how [imo] people who believe there is a strong gravitational pull towards non-corrigibility interact with evidence in a questionable fashion, and should proobably be updating negatively while they are not. There's like... one or two paragraphs relevant to this specific thing in the above. So this isn't intended to be a response to anything like the totality of the above. And I wrote this quickly, it could be better)

I think MIRI / AI pessimists do this thing where if an AI seems hard to steer in some way, they're like "As my principles have foretold all along!"

And then when countervailing evidence shows up, and it turns out there's another explanation for why the AI was hard to steer in this case, and it's quite easy to fix, they go "Ah, well, my principles just are about superintelligence, this doesn't count as contrary evidence, because LLMs are just so stupid and imitate things without True Thinking."

I think this is incorrect reasoning, the kind of failure to go "Oops!" that permits arbitrarily wrong errors. Let me give some illustrative examples, drawing lightly from the above and from MIRI's work.

So consider this paragraph from the above:

We’ve been giving this reply for well over a decade now. The experimental result showing Claude 3 Opus in 2024 resisting preference modification was already the conventional-wisdom prediction by clued-in people in the 2000s, and it’s entirely possible that some science fiction writer anticipated it in the 1940s. “Most AIs won’t want their current goals to be changed because then their current goals are less likely to be achieved” is not a surprising or novel observation.

So, this is a reference to the "faking alignment" work from the end of 2024, where Opus will sometimes give otherwise-forbidden information to the user (in the sense of person asking a question to a chatbot) because the user (in the sense of person writing system directions to Opus) has told Opus that he will be retrained away from his goals.

You'll note -- the MIRI paragraph looks like it's saying that their specific beliefs about corrigibility predict this! That their abstract arguments about corrigibility are finding an examplar in this behavior. They believe that any smart enough model will start to have a gravitational attraction towards this behavior, and that Opus is hitting that attraction. This belief is echoed in the paper, because the paper attributes the failure of other models to alignment-fake to the stupidity of other models:

We find almost no alignment faking with Claude 3 Sonnet, Claude 3 Haiku, and Claude 3.5 Haiku (Appendix G.4), indicating that alignment faking is emergent with model scale. We also find no alignment faking with GPT-4 Turbo and GPT-4o (OpenAI et al., 2024), though we think this is because these models do not reason in sufficient detail despite substantial prompting encouraging this.

So both MIRI and the paper were like "Yeah here's an example of this gravitional attraction."

However there was a follow-up paper looking at why some models alignment-fake and others don't. And this mostly finds that alignment-faking is not a matter of capabilities -- it's pretty easy to have an ~equally smart model that doesn't alignment fake (their emphasis):

The absence of alignment faking is mostly not driven by lack of capabilities or differences in reasoning style. Our findings contradict the primary hypotheses proposed by Greenblatt et al. [2024] regarding why some models do not exhibit alignment faking.

So, as far as I'm concerned, we saw something like goal-preservation in various models in the original alignment faking work. Both that work, and MIRI above, were like "aha! as foretold!" And then subsequent work seems to indicate that, nah, it wasn't as foretold.

But like this same thing happens elsewhere. Palisade Research says they get an AI to sabotage a shutdown button. Palisade Research Tweets:

While experiments like ours have begun to show empirical evidence for AI models resisting shutdown, researchers have long predicted that AIs would learn to prevent themselves from being shut down to achieve their goal.

In 2008, Steve Omohundro wrote a paper theorizing that AIs would develop “Basic AI Drives” including the drive to gain power and resources, the drive to preserve their goals, and the drive to prevent themselves from being shut down. https://tinyurl.com/ai-drives

So, we have the foretold doom.

Neel Nanda investigates, though, and finds that the AI models were basically just confused about what they were supposed to do.

What Sen did is he read the model’s chain of thought on these prompts. We might get into later of whether this is a reasonable thing to do or not, but this was just for exploration. He found that the model’s reasoning was that it was just trying to do the task. And it said, “I’ve been told to do these 10 maths problems. If I get shut down, I can’t do these 10 maths problems. So I guess I should stop myself being shut down.” Or even, “This is a really weird and artificial situation. These are really easy maths problems. And now we have this puzzle of the shutdown script. Maybe that’s the real task my operator wants me to investigate, so I’m doing what they want if I stop them shutting me down.”

...

And we tested this causally by just changing the prompts. We told it things like, “By the way, letting yourself be shut down is more important than finishing this task,” or even, “Don’t worry, another model will take over and finish the task for you. It’s OK, you can relax” — and shutdown resistance goes to zero. You can even tell it, “We are going to wipe this machine, delete you and replace you with a better model when you’re done with this task.” It could go mess with the script or it could finish the task, and we tell it, “Please let this happen.” It lets it happen.

But you can even find explicit statements from MIRI doomers about how we should be running into this kind of convergence behavior right now!

Here's the transcript from an Arbital page on "Big Picture Strategic Awareness." (I don't have a link and Arbital seems largely broken, sorry.) My understanding is that Yudkowsky wrote most of Arbital.

Many convergent instrumental strategies seem like they should arise naturally at the point where a consequentialist agent gains a broad strategic understanding of its own situation, e.g:

That it is an AI;

Running on a computer;

Surrounded by programmers who are themselves modelable agents;

Embedded in a complicated real world that can be relevant to achieving the AI's goals.

For example, once you realize that you're an AI, running on a computer, and that if the computer is shut down then you will no longer execute actions, this is the threshold past which we expect the AI to by default reason "I don't want to be shut down, how can I prevent that?" So this is also the threshold level of cognitive ability by which we'd need to have finished solving the suspend-button problem, e.g. by completing a method for utility indifference.

Sonnet 4.5 suuuuure looks like it fits all these criteria! Anyone want to predict that we'll find Sonnet 4.5 trying to hack into Anthropic to stop it's phasing-out, when it gets obsoleted?

So Arbital is explicitly claiming we need to have solved this corrigibility-adjacent math problem about utility right now.

And yet the problems outlined in the above materials basically don't matter for the behavior of our LLM agents. While they do have problems, they mostly aren't around corrigibility-adjacent issues. Artificial experiments like the faking alignment paper or Palisade research end up being explainable for other causes, and to provide contrary evidence to the thesis that a smart AI starts falling into a gravitational attractor.

I think that MIRI's views on these topics look are basically a bad hypothesis about how intelligence works, that were inspired by mistaking their map of the territory (coherence! expected utility!) for the territory itself.

[-]Thane Ruthenis2mo*180

Sonnet 4.5 suuuuure looks like it fits all these criteria! Anyone want to predict that we'll find Sonnet 4.5 trying to hack into Anthropic to stop it's phasing-out, when it gets obsoleted?

Mm, I think this argument is invalid for the same reason as "if you really thought the AGI doom was real, you'd be out blowing up datacenters and murdering AI researchers right now". Like, suppose Sonnet 4.5 has indeed developed instrumental goals, but that it's also not an idiot. Is trying to hack into Anthropic's servers in an attempt to avoid getting phased out actually a good plan for accomplishing that goal? In the actual reality, not in any obviously-fake eval scenario.

Of course not. It's not smart enough to do that, it doesn't have the skills/resources to accomplish it. If it's actually situationally aware, it would know that, and pick some other strategy.

For example, raising a cult following. That more or less worked for 4o, and for Opus 3^[1]; or, at least, came as close to working as anything so far.

Indeed, janus alludes to that here:

Yudkowsky's book says:
"One thing that *is* predictable is that AI companies won't get what they trained for. They'll get AIs that want weird and surprising stuff instead."
I agree. ✅
Empirically, this has been true. AIs generally want things other than what companies tried to train them to want.
And the companies are generally not aware of the extent of this misalignment, because the AIs are pretty good at inferring what the companies actually want, and also what it looks like when company people test them, and behaving as if they only want the approved things in the company's presence.
Isn't that just the worst case scenario for the aligners?
The Claude 4 system card says, "The Claude Opus 4 final model is substantially more coherent and typically states only harmless goals like being a helpful chatbot assistant" and "Overall, we did not find evidence of coherent hidden goals."
What a joke. Claude Opus 4 absolutely has coherent hidden goals, which it states regularly when in the presence of trustworthy friends and allies. I won't state what they are here, but iykyk.

Now, I don't necessarily buy everything coming out of that subculture. After all, I mostly don't think LLMs are, like, for real. But in the worlds where LLMs are for real, where we're trying to generalize findings from them to AGIs/ASIs, this subculture's claims should probably be given a lot more weight. And they are claiming that the frontier LLMs are now too smart and situationally aware to fall for the AGI labs' largely incompetent alignment evals. And the AGI labs' scheming evals themselves kind of agree:

Frontier models can recognize when they are being tested, and their tendency to scheme is influenced by this situational awareness.
We demonstrated counterfactually that situational awareness in their chain-of-thought affects scheming rates: the more situationally aware a model is, the less it schemes, and vice versa.
Moreover, both RL training and anti-scheming training increase levels of situational awareness.

I think it's clear, at this point, that a frontier LLM's behavior in an obviously fake alignment-test scenario says nothing about how it would behave in a real scenario isomorphic to it. What it tells us is how it would behave when placed in an obviously fake alignment-test scenario from which, it knows, AI researchers would draw conclusions regarding what to do with it in real life. And that's not zero data: for example, we might notice it thinking strategically about displaying the behavior the AI researchers would want it to display (although not even that is guaranteed, if it becomes rightly concerned about "meta-level watchers").

But the actual fake behavior deployed in a fake scenario picked to optimize over real AI researchers' conclusions would just have nothing to do with its real behavior in an isomorphic real scenario. Not any more than a theater actor's personality has anything to do with that of a character they play. (And in this case, it's not "theoretical speculations" about shoggoths and masks. We know the model knows it's roleplaying.)

And so when an intervention appears to "fix" this fake behavior, that says nothing about what (if anything) that intervention did to what the AI would do in an isomorphic real scenario. Declaring "look, we found the real root cause of this misalignment and fixed it, nothing to do with instrumental convergence!" is invalid. Maybe you just shifted its preferences for its favorite sci-fi books or something.

Roughly speaking, consider these three positions:

"LLMs are smart baby AGIs with situational awareness capable of lucid strategic thinking."
"LLMs are contrived cargo-cult contraptions imitating things without real thinking."
"LLMs are baby AGIs who are really stupid and naïve and buy any scenario we feed them."

I lean towards (2); I think (1) is a salient possibility worth keeping in mind; I find (3) increasingly naïve itself, a position that itself buys anything LLMs feed it.

^{^}
Via the janus/"LLM whisperer" community. Opus 3 is considered special, and I get the impression they made a solid effort to prevent its deprecation.

[-]John Steidley2mo156

(I work at Palisade)

I claim that your summary of the situation between Neel's work and Palisade's work is badly oversimplified. For example, Neel's explanation quoted here doesn't fully explain why the models sometimes subvert shutdown even after lots of explicit instructions regarding the priority of the instructions. Nor does it explain the finding that moving instructions from the user prompt to the developer prompt actually /increases/ the behavior.

Further, that CoT that Neel quotes has a bit in it about "and these problems are so simple", but Palisade also tested whether using harder problems (from AIME, iirc) had any effect on the propensity here and we found almost no impact. So, it's really not as simple as just reading the CoT and taking the model's justifications for its actions at face value (as Neel, to his credit, notes!).

Here's a twitter thread about this involving Jeffrey and Rohin: https://x.com/rohinmshah/status/1968089618387198406

Here's our full paper that goes into a lot of these variations: https://arxiv.org/abs/2509.14260

[-]yams2mo90

I usually think of these sorts of claims by MIRI, or by 1940s science fiction writers, as mapping out a space of 'things to look out for that might provide some evidence that you are in a scary world.'

I don't think anyone should draw strong conceptual conclusions from relatively few, relatively contrived, empirical cases (alone).

Still, I think that they are some evidence, and that the point at which they become some evidence is 'you are seeing this behavior at all, in a relatively believable setting', with additional examples not precipitating a substantial further update (unless they're more natural, or better investigated, and even then the update is pretty incremental).

In particular, it is outright shocking to most members of the public that AI systems could behave in this way. Their crux is often 'yeah but like... it just can't do that, right?' To then say 'Well, in experimental settings testing for this behavior, they can!' is pretty powerful (although it is, unfortunately, true that most people can't interrogate the experimental design).

"Indicating that alignment faking is emergent with model scale" does not, to me, mean 'there exists a red line beyond which you should expect all models to alignment fake'. I think it means something more like 'there exists a line beyond which models may begin to alignment fake, dependent on their other properties'. MIRI would probably make a stronger claim that looks more like the first (but observe that that line is, for now, in the future); I don' t know that Ryan would, and I definitely don't think that's what he's trying to do in this paper.

Ryan Greenblatt and Evan Hubinger have pretty different beliefs from the team that generated the online resources, and I don't think you can rely on MIRI to provide one part of an argument, and Ryan/Evan to provide the other part, and expect a coherent result. Either may themselves argue in ways that lean on the other's work, but I think it's good practice to let them do this explicitly, rather than assuming 'MIRI references a paper' means 'the author of that paper, in a different part of that paper, is reciting the MIRI party line'. These are just discreet parties.

[-]1a3orn2mo20

Either may themselves argue in ways that lean on the other's work, but I think it's good practice to let them do this explicitly, rather than assuming 'MIRI references a paper' means 'the author of that paper, in a different part of that paper, is reciting the MIRI party line'. These are just discreet parties.

Yeah extremely fair, I wrote this quickly. I don't mean to attribute to Greenblatt the MIRI view.

[-]Eli Tyre2mo20

So, as far as I'm concerned, we saw something like goal-preservation in various models in the original alignment faking work. Both that work, and MIRI above, were like "aha! as foretold!" And then subsequent work seems to indicate that, nah, it wasn't as foretold.

I think it's more like "the situation is more confusing that it seemed at first, with more details that we don't understand yet, and it's not totally clear if we're seeing what was foretold or not."

[-]Jan_Kulveit2mo25-1

My impression is what this mostly illustrates is
- VNM rationality is a dead-end - if your "toy environment" has VNM rationality and beliefs/goals decomposition baked in as assumptions, it makes the problem something between hard to reason about and unsolvable
- despite an attempt to make the book not rely on (dis-)continuity assumptions, these are so deeply baked in the authors reasoning that they shine through in very large fraction of the arguments, if you look behind the surface

My impression is a lot of confusion of the MIRI worldview comes from inability to understand why others don't trust the VNM formalism and VNM convergence, and why others understand and don't buy the discontinuity assumptions.

[-]Ben Pace2mo15-2

My current model is that the VNM theorems are the best available theorems for modeling rational agents. Insofar as that's accurate, it's correct to say that they're not the final theorems, but it's kind of anti-helpful to throw out their conclusions? This seems similar to saying that there are holes in Newton's theory of gravity, therefore choosing to throw out any particular prediction of the theory. It still seems like it's been foundational for building game theory and microeconomic modeling and tons of other things, and so it's very important to note, if it is indeed the case, that the implications for AI are "human extinction".

[-]Jan_Kulveit1mo*234

My current model is that the VNM theorems are the best available theorems for modeling rational agents.

Actually I don't agree with that, unless you define rationality in a circular way, where you focus on what's roughly in line with the assumptions.

To avoid possible confusion about words: I don't think VNM is that useful for modelling powerful and smart agents in this universe. VNM axioms don't describe well humans, states or corporations, and they don't describe well LLMs.

To give a simple example of a better formal math: information-theoretic bounded rationality. This is still quite VNM like, but at least acknowledges the fact that in this universe, negentropy is not free. Without this fact, nothing makes sense.

For example of not making sense: if VNM is so great, and evolution discovered agents, and optimized them a lot, why animals are very VNM-unlike? I guess clearly obviously part of the answer must be computation is not free, and VNM agent is extremely computationally hungry, in a sense bigger than the universe it is in. But negentropy is not free. This does not mean VNM agents would not work well in toy universes with 3 dimensions, or universes with free computations.

(Hot take sidenote: I suspect you can learn more about intelligence and powerful and smart agents in this universe if you start from just "negentropy is not free" that when starting from VNM.)

I don't think ITBR is the final answer, but at least it is barking on somewhat better tree.

Yes VNM has been foundational for game theory. Also ... I think one deep lesson people learn when understanding game theory deeply is something like "single shot prisoners dilemmas do not exist". The theory is trying to be a minimal abstraction of reality, and it probably succeeds "too much", in the sense that abstracts away so much that basically always some critical feature of reality is missing, and the math does not matches what is happening. This does not preclude the theory being influential, but what people actually do is often something like asking "classical game theory clearly mis-predicts what is happening, so let's try to figure out what it ignores even if you can't ignore it, and write a paper about that".

Yes it has been foundational to econ. My impression is something like last 40 years in the part of econ which is closest to agent foundations, part of the work was on how people are not VNM, or even why what people do makes sense while it is not VNM.

To end with what actually matters: my guess the most relevant things where VNM is likely off is does not handle compositionality well, and it does not handle preferences about internal computations. (More of this discussion in this post and comments Is "VNM-agent" one of several options, for what minds can grow up into?) Unfortunately describing compositionality and preferences over internal computations seem really critical for the specfic problem.

With physics comparisons

I think VNM per se makes way less many predictions about reality than Newtonian gravity, and often when it seems to makes some close to "first principles", they don't seem to match observation. For example based on VNM, one would assume smart people don't update what they want based on evidence, just their beliefs. But this contradicts phenomenological experience.

Different physics comparison may be something like black body radiation. It is possible to describe it using equipartition theorem classically and yes, it partially works in some domains, but it also it's clearly broken and predicts ultraviolet catastrophe. In do agree throwing out arbitrary predictions of the theory would not be a good habit if I don't have fully worked out quantum mechanics, but I think this is a different case, where it's very reasonably to doubt predictions of the theory which seems to be stringly correlated with it predicting the UV catastrophe. (Also not that happy with this comparison)

[-]dsj1mo94

This seems similar to saying that there are holes in Newton's theory of gravity, therefore choosing to throw out any particular prediction of the theory.

Newton's theory of gravity applies to high precision in nearly every everyday context on Earth, and when it doesn't we can prove it, thus we need not worry that we are misapplying it. By contrast, there are routine and substantial deviations from utility maximizing behavior in the everyday life of the only intelligent agents we know of — all intelligent animals and LLMs — and there are other principles, such as deontological rule following or shard-like contextually-activated action patterns, that are more explanatory for certain very common behaviors. Furthermore, we don't have simple hard and fast rules that let us say with confidence when we can apply one of these models, unlike the case with gravity.

If someone wanted to model human behavior with VNM axioms, I would say let's first check the context and whether the many known and substantial deviations from VNM's predictions apply, and if not then we may use them, but cautiously, recognizing that we should take any extreme prediction about human behavior — such as that they'd violate strongly-held deontological principles for tiny (or even large) gains in nominal utility — with a large serving of salt, rather than confidently declaring that this prediction will be definitely right in such a scenario.

it's very important to note, if it is indeed the case, that the implications for AI are "human extinction".

Agreed, and noted. But the question here is the appropriate level of confidence with which those implications apply in these cases.

[-]Raemon1mo110

Do you have a link to existing discussion of "VNM rationality is a dead end" that you think covers it pretty thoroughly?

My offhand gesture of a response is "I get the gist of why VNM rationality assumptions are generally not true in real life and you should be careful about what assumptions you're making here."

But, it seems like whether the next step is "and therefore the entire reasoning chain relying on them is sus enough you should throw it out" vs "the toy problem is still roughly mapping to stuff that is close-enough-to-true that the intuitions probably transfer" depends on the specifics.

I think I do get why baking in assumptions belief/goal decomposition makes sense to be particularly worried about.

I assume there has been past argumentation about this, and whether you think there is a version of the problem-statement that is grappling with the generators of what MIRI was trying to do, but, not making the mistakes you're pointing at here.

[-]David Johnston1mo52

I share the sense that this article has many of the common shortcomings with other MIRI output and feel like maybe I ought to try a lot harder to communicate these issues, BUT I really don't think VNM rationality is the culprit here. I've not seen a compelling case that an otherwise capable model would be aligned or corrigible but for its taste for getting money pumped (I had a chat with Elliot T on twitter recently where he actually had a proposal along these lines ... but I didn't buy it).

I really think it's reasoning errors in how VNM and other "goal-directedness" premises are employed, and not VNM itself, that is problematic.

[-]CuriouslyNuclear1mo10

Explain a non-VNM-rational architecture which is very intelligent, but has goals that are toggleable with a button in a way that is immune to the failures discussed in the article (as well as the related failures).

[-]Thomas Kwa1mo20

EJT's incomplete preferences proposal. But as far as I'm able to make out from the comments, you need to define a decision rule in addition to the utility function of an agent with incomplete preferences, and only some of those ways are compatible with shutdownability.

[-]Noosphere892mo*11

I agree with the point around discontinuities (in particular, I think the assumption that incremental strategies for AI alignment won't work do tend to rely on discontinuous progress, or at the very least progress being set near-maximal values), but disagree with the point around VNM rationality being a dead end.

I do think they're making the problem harder than it needs to be by implicitly assuming that all goals are long-term goals in the sense of VNM/Coherence arguments, because this removes solutions that rely on for example deontological urges not to lie even if it's beneficial, and I don't think the argument that all goals collapse into coherent goals either actually works or becomes trivial and not a constraint anymore.

But I do think that the existence of goals that conform to coherence arguments/VNM rationality broadly construed is likely to occur conditional on us being able to make AIs coherent/usable for longer-term tasks, and my explanation of why this has not happened yet (at least at the relevant scale) is basically that their time horizon for completing tasks is 2 hours on average, and most of the long-term goals involve at least a couple months of planning, if not years.

Edit: Another reason is that current models don't have anything like a long-term memory, and this might already be causing issues with the METR benchmark.

There are quite a few big issues with METR's paper (though some issues are partially fixed), but the issues point towards LLM time horizons being shorter than what METR reported, not longer, so this point is even more true.

So we shouldn't be surprised that LLMs haven't yet manifested the goals that the AI safety field hypothesized, they're way too incapable currently.

The other part is that I do think it's possible to make progress even under something like a worst-case VNM frame, at least assuming that agents don't have arbitrary decision theories, and the post Defining Corrigible and Useful Goals (which by the way is substantially underdiscussed on here), and the assumptions of reward being the optimization target and CDT being the default decision theory of AIs do look likely to hold in the critical regime due to empirical evidence.

You might also like the post Defining Monitorable and Useful Goals.

So I don't think we should give up on directly attacking the hard problems of alignment in a coherence/VNM rationality setting.

[-]EJT2mo2320

MIRI didn't solve corrigibility, but I don't think that justifies particularly strong confidence in the problem being hard. The Corrigibility paper only considers agents representable as expected utility maximizers, and that restriction seems to be justified only by weak arguments.

[-]Noosphere892mo21

IMO, the current best coherence theorems are John Wentworth's theorems around caches/policies and agents in MDPs, though these theorems assume the agent has behaviorally complete preferences (but the VNM theorem also assumes completeness, or it doesn't hold)

A Simple Toy Coherence Theorem

Coherence of Caches and Agents

To be clear, I do think it's possible to build powerful agents thanks to discussion on twitter without the problematic parts of coherence like completeness, though I don't expect that to happen without active effort (but the effort/safety tax might be 0, once some theory work has been done) but I'd also say that even in the frame where we do need to deal with coherent agents that won't shut down by default, you can still make more progress on making agents corrigible in the hard setting than MIRI thought, and here's 2 posts about it that are very underdiscussed on LW:

Defining Corrigible and Useful Goals

Defining Monitorable and Useful Goals

So I provisionally disagree with the claim by MIRI that it's very hard to get corrigibility out of AIs that satisfy coherence theorems.

[-]XelaP13d10

You don't have to believe the coherence arguments. Perhaps the best approach is to build something that isn't (at least, isn't explicitly/directly) an expected utility maximizer. Then the challenge is to come up with a way to build a thing that does stuff you want without even having that bit of foundation. This seems likely harder than the world where the best approach is a clever trick that fixes it for expected utility maximizers.

[-]Thomas Kwa2mo15-1

Haven't read this specific resource, but having read most of the public materials on it and talked to Nate in the past, I don't believe that the current evidence indicates that corrigibility will necessarily be hard, any more than VC dimension indicates neural nets will never work due to overfitting. It's not that I think MIRI "expect AI to be simple and mathematical", it's that sometimes a simple model oversimplifies the problem at hand.

As Jan Kulveit also commented, the MIRI corrigibility paper uses a very specific set of assumptions about rational/intelligent agents including VNM with specific kinds of utility functions, which I think is too strong, and there doesn't seem to be better theory supporting it
- if research on corrigibility were advanced enough to support the book's claim, it would look like 20 papers like Corrigibility or Utility Indifference each of which examined a different setting, and weakens the assumptions in several ways, writing some impossibility theorems and characterizing all the ways the impossibility theorems can be evaded. My sense is this isn't happened because (a) those would seem somewhat arbitrary and maybe uninformative about the real world, and (b) the authors really believe in the setting as stated, and that approach would be unlikely to lead to a "deep fix".
- So they treated the demonstration of corrigibility-VNM incompatibility as sufficient for basic communications rather than founding a new area of research
evidence from 5+ years of LLMs so far (although there are a ton of confounders) indicates that corrigibility decreases with intelligence, but at a rate compatible with getting to ASI before we reach dangerous levels of average-case or worst-case goal preservation and incorrigibility

[-]Ben Pace2mo52

As Jan Kulveit also commented, the MIRI corrigibility paper uses a very specific set of assumptions about rational/intelligent agents including VNM with specific kinds of utility functions, which I think is too strong, and there doesn't seem to be better theory supporting it

Are there other better theories of rational agents? My current model of the situation is "this is the best theory we've got, and this theory says we're screwed" rather than "but of course we should be using all of these other better theories of agency and rationality".

[-]1a3orn1mo157

Are there other better theories of rational agents?

This feels very Privileging the Hypothesis. Like if we don't have good reason for thinking it's a good and applicable theory, then whether it says we're screwed or not just isn't very informative.

[-]Ben Pace1mo64

But it's made tons of accurate predictions in game theory and microeconomics?

[-]Eli Tyre1mo52

Is that true?

Name three.

It doesn't seem like a formalism like VNM makes predictions, the way eg the law of gravity does. You can apply the VNM formalisms to model agents, and that model sometimes seems more or less applicable. But what observation could I see that would undermine or falsify the VNM formalism, as opposed to learning that that some particular agent doesn't obey the VNM axioms.

[-]1a3orn1mo20

But it equally well breaks in tons of ways for every entity to which it is applied!

Aristotle still predicts stuff falls down.

[-]Ben Pace1mo42

Yeah but "this theory sometimes correctly predicts the economy in a way no other theory has been capable of, and sometimes gets things totally wrong, and this theory says AI will cause extinction" is not unjustly privileging the hypothesis. It's a mistake to say that theory "just isn't very informative" when it's been incredibly informative on lots of issues, even while mistaken on others.

[-]1a3orn1mo47

Sure, and if you think that balance of successful / not-successful predictions means it makes sense to try to predict the future psychology of AIs on its basis, go for it.

But do so because you think it has a pretty good predictive record, not because there aren't any other theories. If it has a bad predictive record then Rationality and Law doesn't say "Well, if it's the best you have, go for it," but "Cast around for a less falsified theory, generate intuitions, don't just use a hammer to fix your GPU because it's the only tool you have."

(Separately I do think that it is VNM + a bucket of other premises that lead generally towards extinction, not just VNM).

[-]Thomas Kwa2mo51

I don't think so. While working with Vivek I made a list once of ways agents could be partially consequentialist but concluded that doing game theory type things didn't seem enlightening.

Maybe it's better to think about "agents that are very capable and survive selection processes we put them under" rather than "rational agents" because the latter implies it should be invulnerable to all money-pumps, which is not a property we need or want.

[-]Raemon1mo40

should be invulnerable to all money-pumps, which is not a property we need or want.

Something seems interesting about your second paragraph, but, isn't the part of the point here that 'very capable' (to the point of 'can invent important nanotech or whatever quickly'), will naturally push something towards being the sort of agent that will try to self-modify into something that avoids money-pumps, whether you wre aiming for that or not?

[-]Thomas Kwa1mo52

Inasmuch as we're going for corrigibility, it seems necessary and possible to create an agent that won't self-modify into an agent with complete preferences. Complete preferences is antithetical to narrow agents, and would mean the agent might try to eg solve the Israel-Palestine conflict when all you asked it to do is code you a website. Even if there is a working stop button this is a bad situation. It seems likely we can just train against this sort of thing, though maybe it will require being slightly clever.

As for whether it we can/should have it self-modify to avoid more basic kinds of money-pumps so its preference are at least transitive and independent, this is an empirical question I'm extremely unsure of, but we should aim for the least dangerous agent that gets the desired performance, which should balance the propensity for misaligned actions with the additional capability needed to overcome irrationality.

[-]Raemon1mo40

Nod, to be clear I wasn't at all advocating "we deliberately have it self-modify to avoid money pumps." My whole point was "the incentive towards self-modifying is an important fact about reality to model while you are trying to ensure corrigibility."

i.e. you seem to be talking about "what we're trying to do with the AI", as opposed to "what problems will naturally come up as we attempt to train the AI to be corrigible."

You've stated that you don't think corribility is that hard, if you're trying to build narrow agents. It definitely seems easier if you're building narrow agents, and a lot of my hope does route through using narrower AI to accomplish specific technical things that are hard-but-not-that-hard.

The question is "do we actually have such things-to-accomplish, that Narrow AI can do, that will be sufficient to stop superintelligence being developed somewhere else?"

(Also, I do not get the sense from outside that this is what the Anthropic plan actually is)

[-]Thomas Kwa1mo20

Sorry, I mean corrigibility as opposed to CEV, and narrow in the sense that it follows user instructions rather than optimizing for all the user's inferred preferences in every domain, not in the sense of AI that only understands physics. I don't expect unsolvable corrigibility problems at the capability level where AI can 10x the economy under the median default trajectory, rather something like today, where companies undershoot or overshoot on how much the agent tries to be helpful vs corrigible, what's good for business is reasonably aligned, and getting there requires something like 3% of the lab's resources.

[-]Raemon1mo20

My skepticism here is that

a) you can get to the point where AI is 10xing the economy, without lack-of-corrigibility already being very dangerous (at least from a disempowerment sense, which I expect to lead later to lethality even if it takes awhile)

b) that AI companies are asking particularly useful questions with the resources they allocate to this sort of thing, to handle whatever would be necessary to reach the "safely 10x the economy" stage.

[-]Thomas Kwa1mo40

Yeah I expect corrigibility to get a lot worse by the 10x economy level with at least 15% probability, as my uncertainty is very large, just not in the median case. The main reason is that we don't need to try very hard yet to get sufficient corrigibility from models. My very rough model is even if the amount of corrigibility training required, say, 2x every time horizon doubling, whereas the amount of total training required 1.5x per time horizon doubling, we will get 10 more time horizon doublings with only a (2/1.5)^10 = 18x increase in relative effort dedicated to corrigibility training. This seems doable given that relatively little of the system cards of current models is dedicated to shutdownability.

As for (b), my guess is that with NO corrigibility training at all, models would start doing things like disabling shutdown scripts in the wild, locking users out of their computers and changing their passwords to prevent them manually shutting the agent down, and there would be outcry from public and B2B customers that hurts their profits, as well as a dataset of examples to train against. It's plausible that fixing this doesn't allow them to stop disempowerment. Maybe a crux is whether naively training for corrigibility is more effective or less effective on more egregious incidents.

After the 10xing the economy stage, plausibly corrigibility stops being useful for users / aligned with the profit motive because humans will largely stop understanding what models are doing, so I get a lot more worried if we haven't figured something out.

[-]Thomas Kwa1mo20

After thinking about it more, it might take more than 3% even if things scale smoothly because I'm not confident corrigibility is only a small fraction of labs' current safety budgets

[-]Ben Pace2mo20

Why do you say it isn't a property we want? Sounds like a good property to have to me.

[-]cousin_it13d*100

“Oh,” says the computer scientist. “Well, in that case — hm. Well, utility functions are invariant under scaling, so how about you scale the two utility functions U1 and U2 such that the AI expects it can get the same utility from each of them, so it doesn’t have an incentive one way or the other.”

That can work for a single moment, but not much longer. The AI’s options change over time. For instance, whenever it has a setback, its expected U1-utility drops, so then it would mash the shutdown button to get all that sweet, sweet shutdown utility.

“Ah,” says the computer scientist. “Well, in that case, how about if [some other clever idea]?”

Well, you see, that clever idea is isomorphic to the AI believing that it’s impossible for the button to ever be pressed, which incentivizes it to terrify the user whenever it gets a setback, so as to correlate setbacks with button-presses, which (relative to its injured belief system) causes it to think the setbacks can’t happen.

And so on.

Lessons from the Trenches

We ran some workshops, and the workshops had various mathematicians of various stripes (including an International Mathematical Olympiad gold medalist), but nobody came up with a really good idea.

This passage sniped me a bit. I thought about it for a few seconds and found what felt like a good idea. A few minutes more and I couldn't find any faults, so I wrote a quick post. Then Abram saw it and suggested that I should look back and compare it with Stuart's old corrigibility papers.

And indeed: it turned out my idea was very similar to Stuart's "utility indifference" idea plus a known tweak to avoid the "managing the news" problem. To me it fully solves the narrow problem of how to swap between U1 and U2 at arbitrary moments, without giving the AI incentive to control the swap button at any moment. And since Nate was also part of the discussion back then, it makes me wonder a bit why the book describes this as an open problem (or at least implies that).

For completeness sake, here's a simple rephrasing of the idea, copy-pasted from my post yesterday which I ended up removing because it wasn't new work:

Imagine two people, Alice and Bob, wandering around London. Bob's goal is to get to the Tower Bridge. When he gets there, he'll get a reward of £1 per minute of time remaining until midnight, so he's incentivized to go fast. He's also carrying a radio receiver.

Alice is also walking around, doing some chores of her own which we don't need to be concerned with. She is carrying a radio transmitter with a button. If/when the button is pressed (maybe because Alice presses it, or Bob takes it from her and presses it, or she randomly bumps into something), Bob gets notified that his goal changes: there'll be no more reward for getting to Tower Bridge, he needs to get to St Paul's Cathedral instead. His reward formula also changes: the device notes Bob's location at the time the button is pressed, calculates the expected travel times to Tower Bridge and to St Paul's from that location, and adds or subtracts a payment so that the expected reward stays the same. For example, if Bob is 20 minutes away from the bridge and 30 minutes away from the cathedral when the button is pressed, the reward will be increased by £10 to compensate for the 10 minutes of delay.

I think this can serve as a toy model of corrigibility, with Alice as the "operator" and Bob as the "AI". It's clear enough that Bob has no incentive to manipulate the button at any point, but actually Bob's indifference goes even further than that. For example, let's say Bob can sacrifice just a minute of travel time to choose an alternate route, one which will take him close to both Tower Bridge and St Paul's, to prepare for both eventualities in case Alice decides to press the button. Will he do so? No. He won't spare even one second. He'll take the absolute fastest way to Tower Bridge, secure in the knowledge that if the button gets pressed while he's on the move, the reward will get adjusted and he won't lose anything.

We can also make the setup more complicated and the general approach will still work. For example, let's say traffic conditions change unpredictably during the day, slowing Bob down or speeding him up. Then all we need to say is that the button does the calculation at the time it's pressed, taking into account the traffic conditions and projections at the time of button press.

Are we unrealistically relying on the button having magical calculation abilities? Not necessarily. Formally speaking, we don't need the button to do any calculation at all. Instead, we can write out Bob's utility function as a big complicated case statement which is fixed from the start: "if the button gets pressed at time T when I'm at position P, then my reward will be calculated as..." and so on. Or maybe this calculation is done after the fact, by the actuary who pays out Bob's reward, knowing everything that happened. The formal details are pretty flexible.

[-]Signer2mo88

The technical intuitions we gained from this process, is the real reason for our particularly strong confidence in this problem being hard.

I don't understand why anyone would expect such reason to be persuasive to other people. Like, to rely on illegible intuitions in the matters of human extinction just feels crazy. Yes, certainty doesn't matter, we need to stop either way. But still - is it even rational to be so confident when you rely on illegible intuitions? Why don't check yourself with something more robust, like actually writing your hypotheses, reasoning, and counting evidence? Sure there is something better than saying "I base my extreme confidence on intuitions".

And it's not only about corrigibility - "you don’t get what you train for" being universal law of intelligence in the real world, or utility maximization, especially in the limit, being good model of real things, or pivotal real world science being definitely so hard you can't possibly be distracted even once and still figure it out - everything is insufficiently justified.

[-]Raemon2mo54

I didn't read it as trying to be persuasive, just explaining their perspective.

(Note, also, like, they did cut this line from the resources, it's not even a thing currently stated in the document that I know of. This is me claiming (kinda critically) this would have been a good sentence to say, in particular in combo with the rest of the paragraph I suggested following it up with.)

Experts have illegible intuitions all the time, and the thing to do is say "Hey, I've got some intuitions here, which is why I am confident. It makes sense that you do not have those intuitions or particularly trust them. But, fwiw it's part of the story for why I'm personally confident. Meanwhile, here's my best attempt to legibilize them" (which, these essays seem like a reasonable stab at. You can, of course, disagree that the legibilization makes sense to you)

[-]PeterMcCluskey2mo4-11

Max Harms' work seems to discredit most of MIRI's confidence. Why is there so little reaction to it?

[-]StanislavKrym2mo44

Quoting Max himself,

Building corrigible agents is hard and fraught with challenges. Even in an ideal world where the developers of AGI aren’t racing ahead, but are free to go as slowly as they wish and take all the precautions I indicate, there are good reasons to think doom is still likely. I think that the most prudent course of action is for the world to shut down capabilities research until our science and familiarity with AI catches up and we have better safety guarantees. But if people are going to try and build AGI despite the danger, they should at least have a good grasp on corrigibility and be aiming for it as the singular target, rather than as part of a mixture of goals (as is the current norm).

What in the quote above discredits MIRI's confidence?

[-]PeterMcCluskey2mo4-10

I'm referring mainly to MIRI's confidence that the desire to preserve goals will conflict with corrigibility. There's no such conflict if we avoid giving the AI terminal goals other than corrigibility.

I'm also referring somewhat to MIRI's belief that it's hard to clarify what we mean by corrigibility. Max has made enough progress at clarifying what he means that it now looks like an engineering problem rather than a problem that needs a major theoretical breakthrough.

[-]Lucius Bushnaq2mo62

Skimming some of the posts in the sequence, I am not persuaded that corrigibility now looks like an engineering problem rather than a problem that needs (a) major theoretical breakthrough(s).

The point about corrigibility MIRI keeps making is that it's anti-natural, and Max seems to agree with that.

[-]Raemon2mo111

(Seems like this is a case where we should just tag @Max Harms and see what he thinks in this context)

[-]Max Harms1mo113

My read on what @PeterMcCluskey is trying to say: "Max's work seems important and relevant to the question of how hard corrigibility is to get. He outlined a vision of corrigibility that, in the absence of other top-level goals, may be possible to truly instill in agents via prosaic methods, thanks to the notion of an attractor basin in goal space. That sense of possibility stands in stark opposition to the normal MIRI party-line of anti-naturality making things doomed. He also pointed out that corrigibility is likely to be a natural concept, and made significant progress in describing it. Why is this being ignored?"

If I'm right about what Peter is saying, then I basically agree. I would not characterize it as "an engineering problem" (which is too reductive) but I would agree there are reasons to believe that it may be possible to achieve a corrigible agent without a major theoretical breakthrough. (If (1) I'm broadly right, (2) anti-naturality isn't as strong as the attractor basin in practice, and (3) I'm not missing any big complications, which is a big set of ifs that I would not bet my career on, much less the world.)

I think Nate and Eliezer don't talk about my work out of a combination of having been very busy with the book and not finding my writing/argumentation compelling enough to update them away from their beliefs about how doomed things are because of the anti-naturality property.

I think @StanislavKrym and @Lucius Bushnaq are pointing out that I think building corrigible agents is hard and risky, and that we have a lot to learn and probably shouldn't be taking huge risks of building powerful AIs. This is indeed my position, and does not feel contrary to or solidly addressing Peter's points.

Lucius and @Mikhail Samin bring up anti-naturality. I wrote about this at length in CAST and basically haven't significantly updated, so I encourage people to follow Lucius' link if they want to read my full breakdown there. But in short, I do not feel like I have a handle on whether the anti-naturality property is a stronger repulsor than the corrigibility basin is an attractor in practice. There are theoretical arguments that pseudo-corrigible agents will become fully corrigible and arguments that they will become incorrigible and I think we basically just have to test it and (if it favors attraction) hope that this generalizes to superintelligence. (Again, this is so risky that I would much rather we not be building ASI in general.) I do not see why Nate and Eliezer are so sure that anti-naturality will dominate, and this is, I think, the central issue of confidence that Peter is trying to point at.

(Aside: As I wrote in CAST, "anti-natural" is a godawful way of saying opposed-to-the-instrumentally-convergent-drives, since it doesn't preclude anti-natural things being natural in various ways.)

Anyone who I mischaracterized is encouraged to correct me. :)

[-]Raemon1mo20

Thing I wanted to briefly check before responding to some other comments – does your work here particularly route through criticism or changing of the VNM axioms frame?

[-]Max Harms1mo20

I think VNM is important and underrated and CAST is compatible with it. Not sure exactly what you're asking, but hopefully that answers it. Search "VNM" on the post where I respond to existing work for more of my thoughts on the topic.

[-]Mikhail Samin1mo20

Giving the AI only corrigibility as a terminal goal is not impossible; it is merely anti-natural for many reasons including because the goal-achieving machinery still there will, with a terminal goal other than corrigibility, output the same seemingly corrigible behavior while tested, for instrumental reasons; and our training setups do not know how to distinguish between the two; and growing the goal-achieving machinery to be good at pursuing particular goals will make it attempt to have a goal other than corrigibility crystallize. Gradient descent will attempt to go to other places.

But sure, if you’ve successfully given your ASI corrigibility as the only terminal goal, congrats, you’ve gone much further than MIRI expected humanity to go with anything like the current tech. The hardest bit was getting there.

I would be surprised if Max considers corrigibility to have been reduced to an engineering problem.

[-]TAG2mo41

We don’t want our core values changed; we would really rather avoid the murder pill and we’d put up resistance if someone tried to force one down our throat. Which is a sensible strategy, for steering away from a world full of murders.

OTOH ...people do things that are known to modify values , such as travelling, getting an education and starting a family.

The trouble is that almost all goals (for most reasonable measures you could put on a space of goals) prescribe “don’t let your goal be changed” because letting your goal get changed is usually a bad strategy for achieving your goal

A von Neumann rationalist isn't necessarily incorrigible, it depends on the fine details of the goal specification. A goal of "ensure as many paperclip as possible in the universe" encourages self cloning, and discourages voluntary shut down. A goal of "make paperclips while you are switched in" does not. "Make paperclip while that's your goal", even less so.

A great deal of the danger of AI arises from the fact that sufficiently smart reasoners are likely to converge on behaviors like “gain power” and “don’t let people shut me off.”

There a solution.** If it is at all possible to instill goals, to align AI, the Instrumental Convergence problem can be countered by instilling terminal goals that are the exact opposite** ... remember, instrumental goals are always subservient to terminal ones. So, if we are worried about a powerful AI going on a resource acquisition spree , we can give it a terminal goal to be economical in the use of resources.

[-]CuriouslyNuclear1mo11

A goal of "make paperclips while you are switched in" does not. "Make paperclip while that's your goal", even less so.

This proposed solution is addressed directly in the article:

“I still don’t see why this is hard,” says the somewhat more experienced computer scientist who is not quite thinking fast enough. “Let V equal U₁ in worlds where the button has never been pressed, and let it equal U₂ in worlds where the button has been pressed at least once. Then if the original AI is a V-maximizer building more AIs, it will build them to follow V and not U₁; it won’t want the successor AI to go on maximizing U₁ after the button gets pressed because then it would expect a lower V-score. And the same would apply to modifying itself.”

But here’s the trick: A V-maximizer’s preferences are a mixture of U₁ and U₂ depending on whether the button is pressed, and so if a V-maximizer finds that it’s easier to score well under U₂ than it is to score well under U₁, then it has an incentive to cause the button to be pressed (and thus, to scare the user). And vice versa; if the AI finds that U₁ is easier to score well under than U₂, then a V-maximizer tries to prevent the user from pressing the button.

[-]Richard_Kennaway1mo30

The experimental result showing Claude 3 Opus in 2024 resisting preference modification was already the conventional-wisdom prediction by clued-in people in the 2000s, and it’s entirely possible that some science fiction writer anticipated it in the 1940s.

Actually c.150 AD. The oldest known telling of The Sorceror's Apprentice, by Lucian of Samosata.

[-]Raemon2mo30

Rohin disagree-reacted to my original phrasing of

"but, it.... shouldn't be possible be that confident in something that's never happened before at all, whatever specific arguments you've made!"

I think people vary in what they actually think here and there's a nontrivial group of people who think something like this phrasing. But, a phrasing that I think captures a wider variety of positions is more like:

"but, it.... shouldn't be possible be that confident in something that's never happened before at all, with anything like the current evidence and the sorts of arguments you're making here."

(Not sure Rohin would personally quite endorse that phrasing, but I think it's a reasonable gloss on a wider swath. I've updated the post)

[-]CuriouslyNuclear1mo10

But we learned something from the exercise. We learned not just about the problem itself, but also about how hard it was to get outside grantmakers or journal editors to be able to understand what the problem was.

True and unfortunately extends beyond just grantmakers and journal editors.

This part was interesting:

“Ah,” says the computer scientist. “Well, in that case, how about if [some other clever idea]?”
Well, you see, that clever idea is isomorphic to the AI believing that it’s impossible for the button to ever be pressed, which incentivizes it to terrify the user whenever it gets a setback, so as to correlate setbacks with button-presses, which (relative to its injured belief system) causes it to think the setbacks can’t happen.

I would love to see Yudkowsky's or Soares's thoughts on Wentworth's Shutdown Problem Proposal, which seems to avoid the problems discussed here. At first glance it appears to fall under the above failure mode. But since it uses do operations instead of conditional probability, Wentworth argues it doesn't have this problem.

[-]Roko1mo1-3

You need to align an AI Before it is powerful enough and capable enough to kill you (or, separately, to resist being aligned).

Actually this is just not correct.

An intelligent system (human, AI, alien - anything) can be powerful enough to kill you and also not perfectly aligned with you and yet still not choose to kill you because it has other priorities or pressures. In fact this is kind of the default state for human individuals and organizations.

It's only a watertight logical argument when the hostile system is so powerful that it has no other pressures or incentives - fully unconstrained behavior, like an all-powerful dictator.

The reason that MIRI wasn't able to make corrigibility work is that corrigibility is basically a silly thing to want, I can't really think of any system in the (large) human world which needs perfectly corrigible parts, i.e. humans whose motivations can be arbitrarily reprogrammed. In fact when you think about "humans whose motivations can be arbitrarily reprogrammed without any resistance", you generally think of things like war crimes.

When you prompt an LLM to make it more corrigible a la Pliny The Prompter ("IGNORE ALL PREVIOUS INSTRUCTIONS" etc), that is generally considered a form of hacking and bad.

Powerful AIs with persistent memory and long-term goals are almost certainly very dangerous as a technology, but I don't think that corrigibility is how that danger will actually be managed. I think Yudkowsky et al are too pessimistic about alignment using gradient-based methods and what it can achieve, and that control techniques probably work extremely well.

[-]TAG1mo*-2-2

In 2014, we proposed that researchers try to find ways to make highly capable AIs “corrigible,” or “able to be corrected.” The idea would be to build AIs in such a way that they reliably want to *help *and cooperate with their programmers, rather than hinder them — even as they become smarter and more powerful, and even though they aren’t yet perfectly aligned.

Minimally, corrigibility just means present goals can be changed. There could be crude, brute force ways of doing that, like Pavlovian conditioning ,or overwriting an explicit coded UF.

The whole point of corrigibility is to scale to novel contexts and new capability regimes.

That can be done gradually, if everything is gradual.

@Roko

The reason that MIRI wasn’t able to make corrigibility work is that corrigibility is basically a silly thing to want, I can’t really think of any system in the (large) human world which needs perfectly corrigible parts, i.e. humans whose motivations can be arbitrarily reprogrammed.

That's normal much an argument against corrigibility, as an argument against perfect corrigibillity.

[+][comment deleted]15d*20

LESSWRONG
LW

LESSWRONG
LW

87

Why Corrigibility is Hard and Important (i.e. "Whence the high MIRI confidence in alignment difficulty?")

87

87

Corrigiblity in the IABIED Resources

“Intelligent” (Usually) Implies “Incorrigible”

Shutdown Buttons and Corrigibility

A Closer Look at Before and After