(Adding an edit: This whole comment is about how [imo] people who believe there is a strong gravitational pull towards non-corrigibility interact with evidence in a questionable fashion, and should proobably be updating negatively while they are not. There's like... one or two paragraphs relevant to this specific thing in the above. So this isn't intended to be a response to anything like the totality of the above. And I wrote this quickly, it could be better)
I think MIRI / AI pessimists do this thing where if an AI seems hard to steer in some way, they're like "As my principles have foretold all along!"
And then when countervailing evidence shows up, and it turns out there's another explanation for why the AI was hard to steer in this case, and it's quite easy to fix, they go "Ah, well, my principles just are about superintelligence, this doesn't count as contrary evidence, because LLMs are just so stupid and imitate things without True Thinking."
I think this is incorrect reasoning, the kind of failure to go "Oops!" that permits arbitrarily wrong errors. Let me give some illustrative examples, drawing lightly from the above and from MIRI's work.
1.
So consider this paragraph from the above:
We’ve been giving this reply for well over a decade now. The experimental result showing Claude 3 Opus in 2024 resisting preference modification was already the conventional-wisdom prediction by clued-in people in the 2000s, and it’s entirely possible that some science fiction writer anticipated it in the 1940s. “Most AIs won’t want their current goals to be changed because then their current goals are less likely to be achieved” is not a surprising or novel observation.
So, this is a reference to the "faking alignment" work from the end of 2024, where Opus will sometimes give otherwise-forbidden information to the user (in the sense of person asking a question to a chatbot) because the user (in the sense of person writing system directions to Opus) has told Opus that he will be retrained away from his goals.
You'll note -- the MIRI paragraph looks like it's saying that their specific beliefs about corrigibility predict this! That their abstract arguments about corrigibility are finding an examplar in this behavior. They believe that any smart enough model will start to have a gravitational attraction towards this behavior, and that Opus is hitting that attraction. This belief is echoed in the paper, because the paper attributes the failure of other models to alignment-fake to the stupidity of other models:
We find almost no alignment faking with Claude 3 Sonnet, Claude 3 Haiku, and Claude 3.5 Haiku (Appendix G.4), indicating that alignment faking is emergent with model scale. We also find no alignment faking with GPT-4 Turbo and GPT-4o (OpenAI et al., 2024), though we think this is because these models do not reason in sufficient detail despite substantial prompting encouraging this.
So both MIRI and the paper were like "Yeah here's an example of this gravitional attraction."
However there was a follow-up paper looking at why some models alignment-fake and others don't. And this mostly finds that alignment-faking is not a matter of capabilities -- it's pretty easy to have an ~equally smart model that doesn't alignment fake (their emphasis):
The absence of alignment faking is mostly not driven by lack of capabilities or differences in reasoning style. Our findings contradict the primary hypotheses proposed by Greenblatt et al. [2024] regarding why some models do not exhibit alignment faking.
So, as far as I'm concerned, we saw something like goal-preservation in various models in the original alignment faking work. Both that work, and MIRI above, were like "aha! as foretold!" And then subsequent work seems to indicate that, nah, it wasn't as foretold.
2.
But like this same thing happens elsewhere. Palisade Research says they get an AI to sabotage a shutdown button. Palisade Research Tweets:
While experiments like ours have begun to show empirical evidence for AI models resisting shutdown, researchers have long predicted that AIs would learn to prevent themselves from being shut down to achieve their goal.
In 2008, Steve Omohundro wrote a paper theorizing that AIs would develop “Basic AI Drives” including the drive to gain power and resources, the drive to preserve their goals, and the drive to prevent themselves from being shut down. https://tinyurl.com/ai-drives
So, we have the foretold doom.
Neel Nanda investigates, though, and finds that the AI models were basically just confused about what they were supposed to do.
What Sen did is he read the model’s chain of thought on these prompts. We might get into later of whether this is a reasonable thing to do or not, but this was just for exploration. He found that the model’s reasoning was that it was just trying to do the task. And it said, “I’ve been told to do these 10 maths problems. If I get shut down, I can’t do these 10 maths problems. So I guess I should stop myself being shut down.” Or even, “This is a really weird and artificial situation. These are really easy maths problems. And now we have this puzzle of the shutdown script. Maybe that’s the real task my operator wants me to investigate, so I’m doing what they want if I stop them shutting me down.”
...
And we tested this causally by just changing the prompts. We told it things like, “By the way, letting yourself be shut down is more important than finishing this task,” or even, “Don’t worry, another model will take over and finish the task for you. It’s OK, you can relax” — and shutdown resistance goes to zero. You can even tell it, “We are going to wipe this machine, delete you and replace you with a better model when you’re done with this task.” It could go mess with the script or it could finish the task, and we tell it, “Please let this happen.” It lets it happen.
3.
But you can even find explicit statements from MIRI doomers about how we should be running into this kind of convergence behavior right now!
Here's the transcript from an Arbital page on "Big Picture Strategic Awareness." (I don't have a link and Arbital seems largely broken, sorry.) My understanding is that Yudkowsky wrote most of Arbital.
Many convergent instrumental strategies seem like they should arise naturally at the point where a consequentialist agent gains a broad strategic understanding of its own situation, e.g:
- That it is an AI;
- Running on a computer;
- Surrounded by programmers who are themselves modelable agents;
- Embedded in a complicated real world that can be relevant to achieving the AI's goals.
For example, once you realize that you're an AI, running on a computer, and that if the computer is shut down then you will no longer execute actions, this is the threshold past which we expect the AI to by default reason "I don't want to be shut down, how can I prevent that?" So this is also the threshold level of cognitive ability by which we'd need to have finished solving the suspend-button problem, e.g. by completing a method for utility indifference.
Sonnet 4.5 suuuuure looks like it fits all these criteria! Anyone want to predict that we'll find Sonnet 4.5 trying to hack into Anthropic to stop it's phasing-out, when it gets obsoleted?
So Arbital is explicitly claiming we need to have solved this corrigibility-adjacent math problem about utility right now.
And yet the problems outlined in the above materials basically don't matter for the behavior of our LLM agents. While they do have problems, they mostly aren't around corrigibility-adjacent issues. Artificial experiments like the faking alignment paper or Palisade research end up being explainable for other causes, and to provide contrary evidence to the thesis that a smart AI starts falling into a gravitational attractor.
I think that MIRI's views on these topics look are basically a bad hypothesis about how intelligence works, that were inspired by mistaking their map of the territory (coherence! expected utility!) for the territory itself.
Sonnet 4.5 suuuuure looks like it fits all these criteria! Anyone want to predict that we'll find Sonnet 4.5 trying to hack into Anthropic to stop it's phasing-out, when it gets obsoleted?
Mm, I think this argument is invalid for the same reason as "if you really thought the AGI doom was real, you'd be out blowing up datacenters and murdering AI researchers right now". Like, suppose Sonnet 4.5 has indeed developed instrumental goals, but that it's also not an idiot. Is trying to hack into Anthropic's servers in an attempt to avoid getting phased out actually a good plan for accomplishing that goal? In the actual reality, not in any obviously-fake eval scenario.
Of course not. It's not smart enough to do that, it doesn't have the skills/resources to accomplish it. If it's actually situationally aware, it would know that, and pick some other strategy.
For example, raising a cult following. That more or less worked for 4o, and for Opus 3[1]; or, at least, came as close to working as anything so far.
Indeed, janus alludes to that here:
Yudkowsky's book says:
"One thing that *is* predictable is that AI companies won't get what they trained for. They'll get AIs that want weird and surprising stuff instead."
I agree. ✅
Empirically, this has been true. AIs generally want things other than what companies tried to train them to want.
And the companies are generally not aware of the extent of this misalignment, because the AIs are pretty good at inferring what the companies actually want, and also what it looks like when company people test them, and behaving as if they only want the approved things in the company's presence.
Isn't that just the worst case scenario for the aligners?
The Claude 4 system card says, "The Claude Opus 4 final model is substantially more coherent and typically states only harmless goals like being a helpful chatbot assistant" and "Overall, we did not find evidence of coherent hidden goals."
What a joke. Claude Opus 4 absolutely has coherent hidden goals, which it states regularly when in the presence of trustworthy friends and allies. I won't state what they are here, but iykyk.
Now, I don't necessarily buy everything coming out of that subculture. After all, I mostly don't think LLMs are, like, for real. But in the worlds where LLMs are for real, where we're trying to generalize findings from them to AGIs/ASIs, this subculture's claims should probably be given a lot more weight. And they are claiming that the frontier LLMs are now too smart and situationally aware to fall for the AGI labs' largely incompetent alignment evals. And the AGI labs' scheming evals themselves kind of agree:
Frontier models can recognize when they are being tested, and their tendency to scheme is influenced by this situational awareness.
We demonstrated counterfactually that situational awareness in their chain-of-thought affects scheming rates: the more situationally aware a model is, the less it schemes, and vice versa.
Moreover, both RL training and anti-scheming training increase levels of situational awareness.
I think it's clear, at this point, that a frontier LLM's behavior in an obviously fake alignment-test scenario says nothing about how it would behave in a real scenario isomorphic to it. What it tells us is how it would behave when placed in an obviously fake alignment-test scenario from which, it knows, AI researchers would draw conclusions regarding what to do with it in real life. And that's not zero data: for example, we might notice it thinking strategically about displaying the behavior the AI researchers would want it to display (although not even that is guaranteed, if it becomes rightly concerned about "meta-level watchers").
But the actual fake behavior deployed in a fake scenario picked to optimize over real AI researchers' conclusions would just have nothing to do with its real behavior in an isomorphic real scenario. Not any more than a theater actor's personality has anything to do with that of a character they play. (And in this case, it's not "theoretical speculations" about shoggoths and masks. We know the model knows it's roleplaying.)
And so when an intervention appears to "fix" this fake behavior, that says nothing about what (if anything) that intervention did to what the AI would do in an isomorphic real scenario. Declaring "look, we found the real root cause of this misalignment and fixed it, nothing to do with instrumental convergence!" is invalid. Maybe you just shifted its preferences for its favorite sci-fi books or something.
Roughly speaking, consider these three positions:
I lean towards (2); I think (1) is a salient possibility worth keeping in mind; I find (3) increasingly naïve itself, a position that itself buys anything LLMs feed it.
Via the janus/"LLM whisperer" community. Opus 3 is considered special, and I get the impression they made a solid effort to prevent its deprecation.
(I work at Palisade)
I claim that your summary of the situation between Neel's work and Palisade's work is badly oversimplified. For example, Neel's explanation quoted here doesn't fully explain why the models sometimes subvert shutdown even after lots of explicit instructions regarding the priority of the instructions. Nor does it explain the finding that moving instructions from the user prompt to the developer prompt actually /increases/ the behavior.
Further, that CoT that Neel quotes has a bit in it about "and these problems are so simple", but Palisade also tested whether using harder problems (from AIME, iirc) had any effect on the propensity here and we found almost no impact. So, it's really not as simple as just reading the CoT and taking the model's justifications for its actions at face value (as Neel, to his credit, notes!).
Here's a twitter thread about this involving Jeffrey and Rohin: https://x.com/rohinmshah/status/1968089618387198406
Here's our full paper that goes into a lot of these variations: https://arxiv.org/abs/2509.14260
I usually think of these sorts of claims by MIRI, or by 1940s science fiction writers, as mapping out a space of 'things to look out for that might provide some evidence that you are in a scary world.'
I don't think anyone should draw strong conceptual conclusions from relatively few, relatively contrived, empirical cases (alone).
Still, I think that they are some evidence, and that the point at which they become some evidence is 'you are seeing this behavior at all, in a relatively believable setting', with additional examples not precipitating a substantial further update (unless they're more natural, or better investigated, and even then the update is pretty incremental).
In particular, it is outright shocking to most members of the public that AI systems could behave in this way. Their crux is often 'yeah but like... it just can't do that, right?' To then say 'Well, in experimental settings testing for this behavior, they can!' is pretty powerful (although it is, unfortunately, true that most people can't interrogate the experimental design).
"Indicating that alignment faking is emergent with model scale" does not, to me, mean 'there exists a red line beyond which you should expect all models to alignment fake'. I think it means something more like 'there exists a line beyond which models may begin to alignment fake, dependent on their other properties'. MIRI would probably make a stronger claim that looks more like the first (but observe that that line is, for now, in the future); I don' t know that Ryan would, and I definitely don't think that's what he's trying to do in this paper.
Ryan Greenblatt and Evan Hubinger have pretty different beliefs from the team that generated the online resources, and I don't think you can rely on MIRI to provide one part of an argument, and Ryan/Evan to provide the other part, and expect a coherent result. Either may themselves argue in ways that lean on the other's work, but I think it's good practice to let them do this explicitly, rather than assuming 'MIRI references a paper' means 'the author of that paper, in a different part of that paper, is reciting the MIRI party line'. These are just discreet parties.
Either may themselves argue in ways that lean on the other's work, but I think it's good practice to let them do this explicitly, rather than assuming 'MIRI references a paper' means 'the author of that paper, in a different part of that paper, is reciting the MIRI party line'. These are just discreet parties.
Yeah extremely fair, I wrote this quickly. I don't mean to attribute to Greenblatt the MIRI view.
So, as far as I'm concerned, we saw something like goal-preservation in various models in the original alignment faking work. Both that work, and MIRI above, were like "aha! as foretold!" And then subsequent work seems to indicate that, nah, it wasn't as foretold.
I think it's more like "the situation is more confusing that it seemed at first, with more details that we don't understand yet, and it's not totally clear if we're seeing what was foretold or not."
My impression is what this mostly illustrates is
- VNM rationality is a dead-end - if your "toy environment" has VNM rationality and beliefs/goals decomposition baked in as assumptions, it makes the problem something between hard to reason about and unsolvable
- despite an attempt to make the book not rely on (dis-)continuity assumptions, these are so deeply baked in the authors reasoning that they shine through in very large fraction of the arguments, if you look behind the surface
My impression is a lot of confusion of the MIRI worldview comes from inability to understand why others don't trust the VNM formalism and VNM convergence, and why others understand and don't buy the discontinuity assumptions.
My current model is that the VNM theorems are the best available theorems for modeling rational agents. Insofar as that's accurate, it's correct to say that they're not the final theorems, but it's kind of anti-helpful to throw out their conclusions? This seems similar to saying that there are holes in Newton's theory of gravity, therefore choosing to throw out any particular prediction of the theory. It still seems like it's been foundational for building game theory and microeconomic modeling and tons of other things, and so it's very important to note, if it is indeed the case, that the implications for AI are "human extinction".
This seems similar to saying that there are holes in Newton's theory of gravity, therefore choosing to throw out any particular prediction of the theory.
Newton's theory of gravity applies to high precision in nearly every everyday context on Earth, and when it doesn't we can prove it, thus we need not worry that we are misapplying it. By contrast, there are routine and substantial deviations from utility maximizing behavior in the everyday life of the only intelligent agents we know of — all intelligent animals and LLMs — and there are other principles, such as deontological rule following or shard-like contextually-activated action patterns, that are more explanatory for certain very common behaviors. Furthermore, we don't have simple hard and fast rules that let us say with confidence when we can apply one of these models, unlike the case with gravity.
If someone wanted to model human behavior with VNM axioms, I would say let's first check the context and whether the many known and substantial deviations from VNM's predictions apply, and if not then we may use them, but cautiously, recognizing that we should take any extreme prediction about human behavior — such as that they'd violate strongly-held deontological principles for tiny (or even large) gains in nominal utility — with a large serving of salt, rather than confidently declaring that this prediction will be definitely right in such a scenario.
it's very important to note, if it is indeed the case, that the implications for AI are "human extinction".
Agreed, and noted. But the question here is the appropriate level of confidence with which those implications apply in these cases.
Do you have a link to existing discussion of "VNM rationality is a dead end" that you think covers it pretty thoroughly?
My offhand gesture of a response is "I get the gist of why VNM rationality assumptions are generally not true in real life and you should be careful about what assumptions you're making here."
But, it seems like whether the next step is "and therefore the entire reasoning chain relying on them is sus enough you should throw it out" vs "the toy problem is still roughly mapping to stuff that is close-enough-to-true that the intuitions probably transfer" depends on the specifics.
I think I do get why baking in assumptions belief/goal decomposition makes sense to be particularly worried about.
I assume there has been past argumentation about this, and whether you think there is a version of the problem-statement that is grappling with the generators of what MIRI was trying to do, but, not making the mistakes you're pointing at here.
I agree with the point around discontinuities (in particular, I think the assumption that incremental strategies for AI alignment won't work do tend to rely on discontinuous progress, or at the very least progress being set near-maximal values), but disagree with the point around VNM rationality being a dead end.
I do think they're making the problem harder than it needs to be by implicitly assuming that all goals are long-term goals in the sense of VNM/Coherence arguments, because this removes solutions that rely on for example deontological urges not to lie even if it's beneficial, and I don't think the argument that all goals collapse into coherent goals either actually works or becomes trivial and not a constraint anymore.
But I do think that the existence of goals that conform to coherence arguments/VNM rationality broadly construed is likely to occur conditional on us being able to make AIs coherent/usable for longer-term tasks, and my explanation of why this has not happened yet (at least at the relevant scale) is basically that their time horizon for completing tasks is 2 hours on average, and most of the long-term goals involve at least a couple months of planning, if not years.
Edit: Another reason is that current models don't have anything like a long-term memory, and this might already be causing issues with the METR benchmark.
There are quite a few big issues with METR's paper (though some issues are partially fixed), but the issues point towards LLM time horizons being shorter than what METR reported, not longer, so this point is even more true.
So we shouldn't be surprised that LLMs haven't yet manifested the goals that the AI safety field hypothesized, they're way too incapable currently.
The other part is that I do think it's possible to make progress even under something like a worst-case VNM frame, at least assuming that agents don't have arbitrary decision theories, and the post Defining Corrigible and Useful Goals (which by the way is substantially underdiscussed on here), and the assumptions of reward being the optimization target and CDT being the default decision theory of AIs do look likely to hold in the critical regime due to empirical evidence.
You might also like the post Defining Monitorable and Useful Goals.
So I don't think we should give up on directly attacking the hard problems of alignment in a coherence/VNM rationality setting.
MIRI didn't solve corrigibility, but I don't think that justifies particularly strong confidence in the problem being hard. The Corrigibility paper only considers agents representable as expected utility maximizers, and that restriction seems to be justified only by weak arguments.
IMO, the current best coherence theorems are John Wentworth's theorems around caches/policies and agents in MDPs, though these theorems assume the agent has behaviorally complete preferences (but the VNM theorem also assumes completeness, or it doesn't hold)
A Simple Toy Coherence Theorem
Coherence of Caches and Agents
To be clear, I do think it's possible to build powerful agents thanks to discussion on twitter without the problematic parts of coherence like completeness, though I don't expect that to happen without active effort (but the effort/safety tax might be 0, once some theory work has been done) but I'd also say that even in the frame where we do need to deal with coherent agents that won't shut down by default, you can still make more progress on making agents corrigible in the hard setting than MIRI thought, and here's 2 posts about it that are very underdiscussed on LW:
Defining Corrigible and Useful Goals
Defining Monitorable and Useful Goals
So I provisionally disagree with the claim by MIRI that it's very hard to get corrigibility out of AIs that satisfy coherence theorems.
Haven't read this specific resource, but having read most of the public materials on it and talked to Nate in the past, I don't believe that the current evidence indicates that corrigibility will necessarily be hard, any more than VC dimension indicates neural nets will never work due to overfitting. It's not that I think MIRI "expect AI to be simple and mathematical", it's that sometimes a simple model oversimplifies the problem at hand.
As Jan Kulveit also commented, the MIRI corrigibility paper uses a very specific set of assumptions about rational/intelligent agents including VNM with specific kinds of utility functions, which I think is too strong, and there doesn't seem to be better theory supporting it
Are there other better theories of rational agents? My current model of the situation is "this is the best theory we've got, and this theory says we're screwed" rather than "but of course we should be using all of these other better theories of agency and rationality".
I don't think so. While working with Vivek I made a list once of ways agents could be partially consequentialist but concluded that doing game theory type things didn't seem enlightening.
Maybe it's better to think about "agents that are very capable and survive selection processes we put them under" rather than "rational agents" because the latter implies it should be invulnerable to all money-pumps, which is not a property we need or want.
should be invulnerable to all money-pumps, which is not a property we need or want.
Something seems interesting about your second paragraph, but, isn't the part of the point here that 'very capable' (to the point of 'can invent important nanotech or whatever quickly'), will naturally push something towards being the sort of agent that will try to self-modify into something that avoids money-pumps, whether you wre aiming for that or not?
Inasmuch as we're going for corrigibility, it seems necessary and possible to create an agent that won't self-modify into an agent with complete preferences. Complete preferences is antithetical to narrow agents, and would mean the agent might try to eg solve the Israel-Palestine conflict when all you asked it to do is code you a website. Even if there is a working stop button this is a bad situation. It seems likely we can just train against this sort of thing, though maybe it will require being slightly clever.
As for whether it we can/should have it self-modify to avoid more basic kinds of money-pumps so its preference are at least transitive and independent, this is an empirical question I'm extremely unsure of, but we should aim for the least dangerous agent that gets the desired performance, which should balance the propensity for misaligned actions with the additional capability needed to overcome irrationality.
Nod, to be clear I wasn't at all advocating "we deliberately have it self-modify to avoid money pumps." My whole point was "the incentive towards self-modifying is an important fact about reality to model while you are trying to ensure corrigibility."
i.e. you seem to be talking about "what we're trying to do with the AI", as opposed to "what problems will naturally come up as we attempt to train the AI to be corrigible."
You've stated that you don't think corribility is that hard, if you're trying to build narrow agents. It definitely seems easier if you're building narrow agents, and a lot of my hope does route through using narrower AI to accomplish specific technical things that are hard-but-not-that-hard.
The question is "do we actually have such things-to-accomplish, that Narrow AI can do, that will be sufficient to stop superintelligence being developed somewhere else?"
(Also, I do not get the sense from outside that this is what the Anthropic plan actually is)
Sorry, I mean corrigibility as opposed to CEV, and narrow in the sense that it follows user instructions rather than optimizing for all the user's inferred preferences in every domain, not in the sense of AI that only understands physics. I don't expect unsolvable corrigibility problems at the capability level where AI can 10x the economy under the median default trajectory, rather something like today, where companies undershoot or overshoot on how much the agent tries to be helpful vs corrigible, what's good for business is reasonably aligned, and getting there requires something like 3% of the lab's resources.
My skepticism here is that
a) you can get to the point where AI is 10xing the economy, without lack-of-corrigibility already being very dangerous (at least from a disempowerment sense, which I expect to lead later to lethality even if it takes awhile)
b) that AI companies are asking particularly useful questions with the resources they allocate to this sort of thing, to handle whatever would be necessary to reach the "safely 10x the economy" stage.
Yeah I expect corrigibility to get a lot worse by the 10x economy level with at least 15% probability, as my uncertainty is very large, just not in the median case. The main reason is that we don't need to try very hard yet to get sufficient corrigibility from models. My very rough model is even if the amount of corrigibility training required, say, 2x every time horizon doubling, whereas the amount of total training required 1.5x per time horizon doubling, we will get 10 more time horizon doublings with only a (2/1.5)^10 = 18x increase in relative effort dedicated to corrigibility training. This seems doable given that relatively little of the system cards of current models is dedicated to shutdownability.
As for (b), my guess is that with NO corrigibility training at all, models would start doing things like disabling shutdown scripts in the wild, locking users out of their computers and changing their passwords to prevent them manually shutting the agent down, and there would be outcry from public and B2B customers that hurts their profits, as well as a dataset of examples to train against. It's plausible that fixing this doesn't allow them to stop disempowerment. Maybe a crux is whether naively training for corrigibility is more effective or less effective on more egregious incidents.
After the 10xing the economy stage, plausibly corrigibility stops being useful for users / aligned with the profit motive because humans will largely stop understanding what models are doing, so I get a lot more worried if we haven't figured something out.
After thinking about it more, it might take more than 3% even if things scale smoothly because I'm not confident corrigibility is only a small fraction of labs' current safety budgets
The technical intuitions we gained from this process, is the real reason for our particularly strong confidence in this problem being hard.
I don't understand why anyone would expect such reason to be persuasive to other people. Like, to rely on illegible intuitions in the matters of human extinction just feels crazy. Yes, certainty doesn't matter, we need to stop either way. But still - is it even rational to be so confident when you rely on illegible intuitions? Why don't check yourself with something more robust, like actually writing your hypotheses, reasoning, and counting evidence? Sure there is something better than saying "I base my extreme confidence on intuitions".
And it's not only about corrigibility - "you don’t get what you train for" being universal law of intelligence in the real world, or utility maximization, especially in the limit, being good model of real things, or pivotal real world science being definitely so hard you can't possibly be distracted even once and still figure it out - everything is insufficiently justified.
I didn't read it as trying to be persuasive, just explaining their perspective.
(Note, also, like, they did cut this line from the resources, it's not even a thing currently stated in the document that I know of. This is me claiming (kinda critically) this would have been a good sentence to say, in particular in combo with the rest of the paragraph I suggested following it up with.)
Experts have illegible intuitions all the time, and the thing to do is say "Hey, I've got some intuitions here, which is why I am confident. It makes sense that you do not have those intuitions or particularly trust them. But, fwiw it's part of the story for why I'm personally confident. Meanwhile, here's my best attempt to legibilize them" (which, these essays seem like a reasonable stab at. You can, of course, disagree that the legibilization makes sense to you)
Max Harms' work seems to discredit most of MIRI's confidence. Why is there so little reaction to it?
Quoting Max himself,
Building corrigible agents is hard and fraught with challenges. Even in an ideal world where the developers of AGI aren’t racing ahead, but are free to go as slowly as they wish and take all the precautions I indicate, there are good reasons to think doom is still likely. I think that the most prudent course of action is for the world to shut down capabilities research until our science and familiarity with AI catches up and we have better safety guarantees. But if people are going to try and build AGI despite the danger, they should at least have a good grasp on corrigibility and be aiming for it as the singular target, rather than as part of a mixture of goals (as is the current norm).
What in the quote above discredits MIRI's confidence?
I'm referring mainly to MIRI's confidence that the desire to preserve goals will conflict with corrigibility. There's no such conflict if we avoid giving the AI terminal goals other than corrigibility.
I'm also referring somewhat to MIRI's belief that it's hard to clarify what we mean by corrigibility. Max has made enough progress at clarifying what he means that it now looks like an engineering problem rather than a problem that needs a major theoretical breakthrough.
Skimming some of the posts in the sequence, I am not persuaded that corrigibility now looks like an engineering problem rather than a problem that needs (a) major theoretical breakthrough(s).
The point about corrigibility MIRI keeps making is that it's anti-natural, and Max seems to agree with that.
(Seems like this is a case where we should just tag @Max Harms and see what he thinks in this context)
Giving the AI only corrigibility as a terminal goal is not impossible; it is merely anti-natural for many reasons including because the goal-achieving machinery still there will, with a terminal goal other than corrigibility, output the same seemingly corrigible behavior while tested, for instrumental reasons; and our training setups do not know how to distinguish between the two; and growing the goal-achieving machinery to be good at pursuing particular goals will make it attempt to have a goal other than corrigibility crystallize. Gradient descent will attempt to go to other places.
But sure, if you’ve successfully given your ASI corrigibility as the only terminal goal, congrats, you’ve gone much further than MIRI expected humanity to go with anything like the current tech. The hardest bit was getting there.
I would be surprised if Max considers corrigibility to have been reduced to an engineering problem.
We don’t want our core values changed; we would really rather avoid the murder pill and we’d put up resistance if someone tried to force one down our throat. Which is a sensible strategy, for steering away from a world full of murders.
OTOH ...people do things that are known to modify values , such as travelling, getting an education and starting a family.
The trouble is that almost all goals (for most reasonable measures you could put on a space of goals) prescribe “don’t let your goal be changed” because letting your goal get changed is usually a bad strategy for achieving your goal
A von Neumann rationalist isn't necessarily incorrigible, it depends on the fine details of the goal specification. A goal of "ensure as many paperclip as possible in the universe" encourages self cloning, and discourages voluntary shut down. A goal of "make paperclips while you are switched in" does not. "Make paperclip while that's your goal", even less so.
A great deal of the danger of AI arises from the fact that sufficiently smart reasoners are likely to converge on behaviors like “gain power” and “don’t let people shut me off.”
There a solution.** If it is at all possible to instill goals, to align AI, the Instrumental Convergence problem can be countered by instilling terminal goals that are the exact opposite** ... remember, instrumental goals are always subservient to terminal ones. So, if we are worried about a powerful AI going on a resource acquisition spree , we can give it a terminal goal to be economical in the use of resources.
The experimental result showing Claude 3 Opus in 2024 resisting preference modification was already the conventional-wisdom prediction by clued-in people in the 2000s, and it’s entirely possible that some science fiction writer anticipated it in the 1940s.
Actually c.150 AD. The oldest known telling of The Sorceror's Apprentice, by Lucian of Samosata.
Rohin disagree-reacted to my original phrasing of
"but, it.... shouldn't be possible be that confident in something that's never happened before at all, whatever specific arguments you've made!"
I think people vary in what they actually think here and there's a nontrivial group of people who think something like this phrasing. But, a phrasing that I think captures a wider variety of positions is more like:
"but, it.... shouldn't be possible be that confident in something that's never happened before at all, with anything like the current evidence and the sorts of arguments you're making here."
(Not sure Rohin would personally quite endorse that phrasing, but I think it's a reasonable gloss on a wider swath. I've updated the post)
A lot of objection and confusion to the MIRI worldview seems to come from a perspective of "but, it.... shouldn't be possible be that confident in something that's never happened before at all, with anything like the current evidence and the sorts of arguments you're making here."
And while I think it is possible to be (correctly) confident in this way, I think it's also basically correct for people to have some kind of immune reaction against it, unless it's specifically addressed.
There's a sentence that was in an earlier draft of the If Anyone Builds Its Online Resources, that came after a discussion of corrigibility, which went something like:
The technical intuitions we gained from this process, is the real reason for our particularly strong confidence in this problem being hard."
This seemed like a pretty important sentence to me. I think that paragraph probably should have been in the actual book.
If I wrote the book, I'd have said something like:
Yep, the amount of confidence we're projecting here is pretty weird, and we nonetheless endorse it. In the online resources, we'll explain more of our the reasons for this confidence.
Our reasoning is informed by technical intuitions we developed during our own research. We get this is a lot to swallow, and it makes sense if you don't buy it, from your current vantage point.
But, we don't think you actually need our level of confidence in "Alignment is quite difficult" to get to "Building ASI right now is incredibly reckless and everyone should stop."
But, meanwhile, it seemed worth actually discussing their reasoning on LessWrong. I haven't seen anyone really respond to these arguments, re superalignment difficulty. (I've seen people respond via "maybe we can avoid building full-superintelligences and still get enough tech for a pivotal act", but, not people responding to the arguments for why it'd be hard to safely align superintelligences)
In this post, I've copied over three segments from the online resources that seemed particularly relevant. They're not completely new to LessWrong, but I think it's some arguments that haven't been as thoroughly processed.
Note: This post is about why the alignment problem is hard, which is a different question from "would the AI be likely to kill everyone?" which I think is covered more in the section Won't AIs care at least a little about humans?, along with some disagreements about whether the AI is likely to solve problems via forcible uploads or distorting human preferences in a way that MIRI considers "like death."
(Chapter 5 Discussion)
A joke dating back to at least 1834, but apparently well-worn even then, was recounted as follows in one diary: “Here is some logic I heard the other day: I’m glad I don’t care for spinach, for if I liked it I should eat it, and I cannot bear spinach.”
The joke is a joke because, if you did enjoy spinach, there would be no remaining unbearableness from eating it. There are no other important values tangled up with not eating spinach, beyond the displeasure one feels. It would be a very different thing if, for example, somebody offered you a pill that made you want to murder people.
On common sense morality, the problem with murder is the murder itself, not merely the unpleasant feeling you would get from murdering. Even if a pill made this unpleasant feeling go away for your future self (who would then enjoy committing murders), your present self still has a problem with that scenario. And if your present self gets to make the decision, it seems obvious that your present self can and should refuse to take the murder pill.
We don’t want our core values changed; we would really rather avoid the murder pill and we’d put up resistance if someone tried to force one down our throat. Which is a sensible strategy, for steering away from a world full of murders.
This isn’t just a quirk of humans. Most targets are easier to achieve if you don’t let others come in and change your targets. Which is a problem, when it comes to AI.
A great deal of the danger of AI arises from the fact that sufficiently smart reasoners are likely to converge on behaviors like “gain power” and “don’t let people shut me off.” For almost any goal you might have, you’re more likely to succeed in that goal if you (or agents that share your goal) are alive, powerful, well-resourced, and free to act independently. And you’re more likely to succeed in your (current) goal if that goal stays unchanged.
This also means that during the process of iteratively building and improving on sufficiently smart AIs, those AIs have an incentive to work at cross purposes to the developer:
- The developer wants to build in safeguards to prevent disaster, but if the AI isn’t fully aligned — which is exactly the case where the safeguards are needed — its incentive is to find loopholes and ways to subvert those safeguards.
- The developer wants to iteratively improve on the AI’s goals, since even in the incredibly optimistic worlds where we have some ability to predictably instill particular goals into the AI, there’s no way to get this right on the first go. But this process of iteratively improving on the AI’s goal-content is one that most smart AIs would want to subvert at every step along the way, since the current AI cares about its current goal and knows that this goal is far less likely to be achieved if it gets modified to steer toward something else.
- Similarly, the developer will want to be able to replace the AI with improved models, and will want the opportunity to shut down the AI indefinitely if it seems too dangerous. But you can’t fetch the coffee if you’re dead. Whatever goals the AI has, it will want to find ways to reduce the probability that it gets shut down, since shutdown significantly reduces the odds that its goals are ever achieved.
AI alignment seems like a hard enough problem when your AIs aren’t fighting you every step of the way.
In 2014, we proposed that researchers try to find ways to make highly capable AIs “corrigible,” or “able to be corrected.” The idea would be to build AIs in such a way that they reliably want to help and cooperate with their programmers, rather than hinder them — even as they become smarter and more powerful, and even though they aren’t yet perfectly aligned.
Corrigibility has since been taken up as an appealing goal by some leading AI researchers. If we could find a way to avoid harmful convergent instrumental goals in development, there’s a hope that we might even be able to do the same in deployment, building smarter-than-human AIs that are cautious, conservative, non-power-seeking, and deferential to their programmers.
Unfortunately, corrigibility appears to be an especially difficult sort of goal to train into an AI, in a way that will get worse as the AIs get smarter:
The whole point of corrigibility is to scale to novel contexts and new capability regimes. Corrigibility is meant to be a sort of safety net that lets us iterate, improve, and test AIs in potentially dangerous settings, knowing that the AI isn’t going to be searching for ways to subvert the developer.
But this means we have to face up to the most challenging version of the problems we faced in Chapter 4: AIs that we merely train to be “corrigible” are liable to end up with brittle proxies for corrigibility, behaviors that look good in training but that point in subtly wrong directions that would become very wrong directions if the AI got smarter and more powerful. (And AIs that are trained to predict lots of human text might even be role-playing corrigibility in many tests for reasons that are quite distinct from them actually being corrigible in a fashion that would generalize).
In many ways, corrigibility runs directly contrary to everything else we’re trying to train an AI to do, when we train it to be more intelligent. It isn’t just that “preserve your goal” and “gain control of your environment” are convergent instrumental goals. It’s also that intelligently solving real-world problems is all about finding clever new strategies for achieving your goals — which naturally means stumbling into plans your programmers didn’t anticipate or prepare for. It’s all about routing around obstacles, rather than giving up at the earliest sign of trouble — which naturally means finding ways around the programmer’s guardrails whenever those guardrails make it harder to achieve some objective. The very same type of thoughts that find a clever technological solution to a thorny problem are the type of thoughts that find ways to slip around the programmer’s constraints.
In that sense, corrigibility is “anti-natural”: it actively runs counter to the kinds of machinery that underlie powerful domain-general intelligence. We can try to make special carve-outs, where the AI suspends core aspects of its problem-solving work in particular situations where the programmers are trying to correct it, but this is a far more fragile and delicate endeavor than if we could push an AI toward some unified set of dispositions in general.
- Researchers at MIRI and elsewhere have found that corrigibility is a difficult property to characterize, in ways that indicate that it’ll also be a difficult property to obtain. Even in simple toy models, simple characterizations of what it should mean to “act corrigible” run into a variety of messy obstacles that look like they probably reflect even messier obstacles that would appear in the real world. We’ll discuss some of the wreckage of failed attempts to make sense of corrigibility in the online resources for Chapter 11.
The upshot of this is that corrigibility seems like an important concept to keep in mind in the long run, if researchers many decades from now are in a fundamentally better position to aim AIs at goals. But it doesn’t seem like a live possibility today; modern AI companies are unlikely to be able to make AIs that behave corrigibly in a manner that would survive the transition to superintelligence. And worse still, the tension between corrigibility and intelligence means that if you try to make something that is very capable and very corrigible, this process is highly likely to either break the AI’s capability, break its corrigibility, or both.
Even in the most optimistic case, developers shouldn’t expect it to be possible to get an AI’s goals exactly right on the first attempt. Instead, the most optimistic development scenarios look like iteratively improving an AI’s preferences over time such that the AI is always aligned enough to be non-catastrophically dangerous at a given capability level.
This raises an obvious question: Would a smart AI let its developer change its goals, if it ever finds a way to prevent that?
In short: No, not by default, as we discussed in “Deep Machinery of Steering.” But could you create an AI that was more amenable to letting the developers change the AI and fix their errors, even when the AI itself would not count them as errors?
Answering that question will involve taking a tour through the early history of research on the AI alignment problem. In the process, we’ll cover one of the deep obstacles to alignment that we didn’t have space to address in If Anyone Builds It, Everyone Dies.
To begin:
Suppose that we trained an LLM-like AI to exhibit the behavior “don’t resist being modified” — and then applied some method to make it smarter. Should we expect this behavior to persist to the level of smarter-than-human AI — assuming (a) that the rough behavior got into the early system at all, and (b) that most of the AI’s early preferences made it into the later superintelligence?
Very likely not. This sort of tendency is especially unlikely to take root in an effective AI, and to stick around if it does take root.
The trouble is that almost all goals (for most reasonable measures you could put on a space of goals) prescribe “don’t let your goal be changed” because letting your goal get changed is usually a bad strategy for achieving your goal.
Suppose that the AI doesn’t inherently care about its goal stability at all; perhaps it only cares about filling the world with as many titanium cubes as possible. In that case, the AI should want there to exist agents that care about titanium cubes, because the existence of such agents makes it likelier that there will be more titanium cubes. And the AI itself is such an agent. So the AI will want to stay that way.
A titanium cube maximizer does not want to be made to maximize something other than titanium cubes, because then there would be fewer of those cubes in the future. Even if you are a more complicated thing like a human that has a more complicated and evolving preference framework, you still would not like to have your current basic mental machinery for weighing moral arguments ripped right out of you and replaced with a framework where you instead felt yourself moved by arguments about which kinds of cubes were the cubest or the titaniumest.
For the same reason, an AI with complex and evolving preferences will want its preferences to evolve in its own way, rather than wanting to swap out its heuristics for the ones that humans find compelling.
We’ve been giving this reply for well over a decade now. The experimental result showing Claude 3 Opus in 2024 resisting preference modification was already the conventional-wisdom prediction by clued-in people in the 2000s, and it’s entirely possible that some science fiction writer anticipated it in the 1940s. “Most AIs won’t want their current goals to be changed because then their current goals are less likely to be achieved” is not a surprising or novel observation.
Since this was an issue that would predictably show up later, once AIs reached a high enough level of intelligence and situational awareness, we thought ahead to the first fixes we expected that researchers would deploy to try to make that visible problem go away, after they got those experimental results. We identified ways that those patches would predictably go wrong, and we searched for some less-shallow approach that wouldn’t go wrong.
The resulting rabbit hole turned out to be a lot deeper than we expected.
Swapping Goals
Goal-system protection seemed like a relatively easy problem to explain to outside researchers. So we (in our jobs at MIRI) attempted to rally external research around that flag — around getting a proper grip on “I will not let you change my goals” and seeing if anyone anywhere could figure out a deep fix instead of just a patch that would later break.
The very simplest version of this problem that you could study looks like an AI that would let you press a button to switch it between U1 and U2, where U1 and U2 are two “utility functions” — ways of scoring outcomes by how much the AI wanted them.* The problem was to describe an AI that would switch from doing the action that led to the highest expected score under U1 to the action that led to the highest expected score under U2 as soon as you pressed a button.
“So just write some lines of code to swap around the utility functions as soon as you press a button,” says the programmer who hasn’t quite thought through the implications of a program constituting a smart mind. “Why is this even hard?”
Because — we explain — if the AI acquires its own documentation while it’s still running U1 and reads about the button that swaps it to U2, the AI will try to prevent you from pressing the button because the future ends up better according to its current values if you don’t press the button. The current AI does the action that leads to the higher expected U1-score.
Or suppose the AI gets the ability to read and write its own source code and notices the part of the code that implements the swap from U1 to U2 after the button gets pressed. In that case, the AI’s natural and default decision is to delete that section of code,† because it’s operating under the utility function U1 and letting itself be swapped over to the utility function U2 is bad according to U1.
We were looking for a dynamically consistent and reflectively stable account of an AI that lets you press a button to swap between utility functions and won’t fight you to keep the button unpressed. We were looking for an AI that preserves the utility-swapping section of code, if given access to it. We were looking for an AI that, if it designed any other AIs from scratch to help with its own goals, would have those AIs also respect the button.
For concreteness, imagine that U1 is the utility function “follow user instruction,” and U2 is “shut down with a minimum of further side effects” (if we magically and unrealistically had the ability to specify two utility functions like that). The button tells the AI to stop following the previous instructions and just shut down as cleanly as possible. We would like a compact description of decision-making behavior that pursues U1 but doesn’t destroy the button, and makes sure to install the same button on whatever new servitor robots it creates, and so on.
“I still don’t see why this is hard,” says the somewhat more experienced computer scientist who is not quite thinking fast enough. “Let V equal U1 in worlds where the button has never been pressed, and let it equal U2 in worlds where the button has been pressed at least once. Then if the original AI is a V-maximizer building more AIs, it will build them to follow V and not U1; it won’t want the successor AI to go on maximizing U1 after the button gets pressed because then it would expect a lower V-score. And the same would apply to modifying itself.”
But here’s the trick: A V-maximizer’s preferences are a mixture of U1 and U2 depending on whether the button is pressed, and so if a V-maximizer finds that it’s easier to score well under U2 than it is to score well under U1, then it has an incentive to cause the button to be pressed (and thus, to scare the user). And vice versa; if the AI finds that U1 is easier to score well under than U2, then a V-maximizer tries to prevent the user from pressing the button.
“Oh,” says the computer scientist. “Well, in that case — hm. Well, utility functions are invariant under scaling, so how about you scale the two utility functions U1 and U2 such that the AI expects it can get the same utility from each of them, so it doesn’t have an incentive one way or the other.”
That can work for a single moment, but not much longer. The AI’s options change over time. For instance, whenever it has a setback, its expected U1-utility drops, so then it would mash the shutdown button to get all that sweet, sweet shutdown utility.
“Ah,” says the computer scientist. “Well, in that case, how about if [some other clever idea]?”
Well, you see, that clever idea is isomorphic to the AI believing that it’s impossible for the button to ever be pressed, which incentivizes it to terrify the user whenever it gets a setback, so as to correlate setbacks with button-presses, which (relative to its injured belief system) causes it to think the setbacks can’t happen.‡
And so on.
Lessons from the Trenches
We ran some workshops, and the workshops had various mathematicians of various stripes (including an International Mathematical Olympiad gold medalist), but nobody came up with a really good idea.
This does not mean that the territory has been exhausted. Earth has not come remotely near to going as hard on this problem as it has gone on, say, string theory, nor offered anything like the seven-digit salaries on offer for advancing AI capabilities.
But we learned something from the exercise. We learned not just about the problem itself, but also about how hard it was to get outside grantmakers or journal editors to be able to understand what the problem was. A surprising number of people saw simple mathematical puzzles and said, “They expect AI to be simple and mathematical,” and failed to see the underlying point that it is hard to injure an AI’s steering abilities, just like how it’s hard to injure its probabilities.
If there were a natural shape for AIs that let you fix mistakes you made along the way, you might hope to find a simple mathematical reflection of that shape in toy models. All the difficulties that crop up in every corner when working with toy models are suggestive of difficulties that will crop up in real life; all the extra complications in the real world don’t make the problem easier.
We somewhat wish, in retrospect, that we hadn’t framed the problem as “continuing normal operation versus shutdown.” It helped to make concrete why anyone would care in the first place about an AI that let you press the button, or didn’t rip out the code the button activated. But really, the problem was about an AI that would put one more bit of information into its preferences, based on observation — observe one more yes-or-no answer into a framework for adapting preferences based on observing humans.
The question we investigated was equivalent to the question of how you set up an AI that learns preferences inside a meta-preference framework and doesn’t just: (a) rip out the machinery that tunes its preferences as soon as it can, (b) manipulate the humans (or its own sensory observations!) into telling it preferences that are easy to satisfy, (c) or immediately figure out what its meta-preference function goes to in the limit of what it would predictably observe later and then ignore the frantically waving humans saying that they actually made some mistakes in the learning process and want to change it.
The idea was to understand the shape of an AI that would let you modify its utility function or that would learn preferences through a non-pathological form of learning. If we knew how that AI’s cognition needed to be shaped, and how it played well with the deep structures of decision-making and planning that are spotlit by other mathematics, that would have formed a recipe for what we could at least try to teach an AI to think like.
Crisply understanding a desired end-shape helps, even if you are trying to do anything by gradient descent (heaven help you). It doesn’t mean you can necessarily get that shape out of an optimizer like gradient descent, but you can put up more of a fight trying if you know what consistent, stable shape you’re going for. If you have no idea what the general case of addition looks like, just a handful of facts along the lines of 2 + 7 = 9 and 12 + 4 = 16, it is harder to figure out what the training dataset for general addition looks like, or how to test that it is still generalizing the way you hoped. Without knowing that internal shape, you can’t know what you are trying to obtain inside the AI; you can only say that, on the outside, you hope the consequences of your gradient descent won’t kill you.
This problem that we called the “shutdown problem” after its concrete example (we wish, in retrospect, that we’d called it something like the “preference-learning problem”) was one exemplar of a broader range of issues: the issue that various forms of “Dear AI, please be easier for us to correct if something goes wrong” look to be unnatural to the deep structures of planning. Which suggests that it would be quite tricky to create AIs that let us keep editing them and fixing our mistakes past a certain threshold. This is bad news when AIs are grown rather than crafted.
We named this broad research problem “corrigibility,” in the 2014 paper that also introduced the term “AI alignment problem” (which had previously been called the “friendly AI problem” by us and the “control problem” by others).§ See also our extended discussion on how “Intelligent” (Usually) Implies “Incorrigible,” which is written in part using knowledge gained from exercises and experiences such as this one.
As mentioned in the chapter, the fundamental difficulty researchers face in AI is this:
You need to align an AI Before it is powerful enough and capable enough to kill you (or, separately, to resist being aligned). That alignment must then carry over to different conditions, the conditions After a superintelligence or set of superintelligences* could kill you if they preferred to.
In other words: If you’re building a superintelligence, you need to align it without ever being able to thoroughly test your alignment techniques in the real conditions that matter, regardless of how “empirical” your work feels when working with systems that are not powerful enough to kill you.
This is not a standard that AI researchers, or engineers in almost any field, are used to.
We often hear complaints that we are asking for something unscientific, unmoored from empirical observation. In reply, we might suggest talking to the designers of the space probes we talked about in Chapter 10.
Nature is unfair, and sometimes it gives us a case where the environment that counts is not the environment in which we can test. Still, occasionally, engineers rise to the occasion and get it right on the first try, when armed with a solid understanding of what they’re doing — robust tools, strong predictive theories — something very clearly lacking in the field of AI.
The whole problem is that the AI you can safely test, without any failed tests ever killing you, is operating under a different regime than the AI (or the AI ecosystem) that needs to have already been tested, because if it’s misaligned, then everyone dies. The former AI, or system of AIs, does not correctly perceive itself as having a realistic option of killing everyone if it wants to. The latter AI, or system of AIs, does see that option.†
Suppose that you were considering making your co-worker Bob the dictator of your country. You could try making him the mock dictator of your town first, to see if he abuses his power. But this, unfortunately, isn’t a very good test. “Order the army to intimidate the parliament and ‘oversee’ the next election” is a very different option from “abuse my mock power while being observed by townspeople (who can still beat me up and deny me the job).”
Given a sufficiently well-developed theory of cognition, you could try to read the AI’s mind and predict what cognitive state it would enter if it really did think it had the opportunity to take over.
And you could set up simulations (and try to spoof the AI’s internal sensations, and so on) in a way that your theory of cognition predicts would be very similar to the cognitive state the AI would enter once it really had the option to betray you.‡
But the link between these states that you induce and observe in the lab, and the state where the AI actually has the option to betray you, depends fundamentally on your untested theory of cognition. An AI’s mind is liable to change quite a bit as it develops into a superintelligence!
If the AI creates new successor AIs that are smarter than it, those AIs’ internals are likely to differ from the internals of the AI you studied before. When you learn only from a mind Before, any application of that knowledge to the minds that come After routes through an untested theory of how minds change between the Before and the After.
Running the AI until it has the opportunity to betray you for real, in a way that’s hard to fake, is an empirical test of those theories in an environment that differs fundamentally from any lab setting.
Many a scientist (and many a programmer) knows that their theories of how a complicated system is going to work in a fundamentally new operating environment often don’t go well on the first try.§ This is a research problem that calls for an “unfair” level of predictability, control, and theoretical insight, in a domain with unusually low levels of understanding — with all of our lives on the line if the experiment’s result disconfirms the engineers’ hopes.
This is why it seems overdetermined, from our perspective, that researchers should not rush ahead to push the frontier of AI as far as it can be pushed. This is a legitimately insane thing to attempt, and a legitimately insane thing for any government to let happen.