If we use Myopic Optimization from Non-myopic Approval (MONA), then the AI will not want to engage in sneaky power-seeking behaviors, because such behaviors will not help with the short-term tasks that the AI-in-a-lab wants to do
I mainly rule out MONA for capabilities reasons; see Thoughts on “Process-Based Supervision” §5.3.
I'd make this argument even for regular outcomes-based RL. Presumably, whenever you are providing rewards (whether in brain-like AGI or RLHF for LLMs), they are based on actions that the AI took over some bounded time in the past, let's call that time bound T. Then presumably sneaky power-seeking behaviors should be desired inasmuch as they pay out within time T. Currently T is pretty small so I don't expect it to pay out. Maybe you're imagining that T will increase a bunch?
Secondarily, I have a concern that, even if a setup seems like it should lead to an AI with exclusively short-term desires, it may in fact accidentally lead to an AI that also has long-term desires. See §5.2.1 of that same post for specifics.
Sure, I agree with this (and this is why I want to do MONA even though I expect outcomes-based RL will already be quite time-limited), but if this is your main argument I feel like your conclusion should be way more measured (more like "this might happen" rather than "this is doomed").
3.5 “The reward function will get smarter in parallel with the agent being trained”
This one comes from @paulfchristiano (2022): “I’m most optimistic about approaches where RL is only performed on a reward function that gets smarter in parallel with the agent being trained.”
I, umm, don’t really know what he’s getting at here. Maybe something like: use AI assistants to better monitor and catch sneaky behavior? If so, (1) I don’t expect the AI assistants to be perfect, so my story still goes through, and (2) there’s a chicken-and-egg problem that will lead to the AI assistants being misaligned and scheming too.
Re: (1), the AI assistants don't need to be perfect. They need to be good enough that the agent being trained cannot systematically exploit them, which is a much lower bar than "perfect" (while still being quite a high bar). In general I feel like you're implicitly treating an adversarial AI system as magically having perfect capabilities while any AI systems that are supposed to be helpful are of course flawed and have weaknesses. Pick an assumption and stick with it!
Re: (2), you start with a non-scheming human, and bootstrap up from there, at each iteration using the previous aligned assistant to help oversee the next AI. See e.g. IDA.
There are lots of ways this could go wrong, I'm not claiming it will clearly work, but I don't think either of your objections matters much.
whenever you are providing rewards (whether in brain-like AGI or RLHF for LLMs), they are based on actions that the AI took over some bounded time in the past, let's call that time bound T. Then presumably sneaky power-seeking behaviors should be desired inasmuch as they pay out within time T. Currently T is pretty small so I don't expect it to pay out. Maybe you're imagining that T will increase a bunch?
Yup! If we set aside RLVR (which I wasn’t making as strong statements about I think), and focus on “brain-like AGI” (some yet-to-be-invented version of model-based RL), then I feel strongly that this is an area where the existing RL literature is disanalogous to future AGI. People can envision a plan that takes days, months, years, decades, and what consequences that plan will have. And those consequences can seem good or bad, and accordingly the person will pursue the plan, or not. There isn’t really any T that connects the planning horizon to the RL training history. My 10-year-old kid has a 30-year career plan in mind that he’s pursuing, longer than he’s been alive. He insists he’s not gonna change his mind, lol :-P
if this is your main argument I feel like your conclusion should be way more measured (more like "this might happen" rather than "this is doomed").
I agree with that (hence “secondarily”) and just edited to make that clearer :)
Re: (1), the AI assistants don't need to be perfect. They need to be good enough that the agent being trained cannot systematically exploit them, which is a much lower bar than "perfect" (while still being quite a high bar). In general I feel like you're implicitly treating an adversarial AI system as magically having perfect capabilities while any AI systems that are supposed to be helpful are of course flawed and have weaknesses. Pick an assumption and stick with it!
I’m arguing for the narrow claim “AI would love to escape control if it were confident that it had an opportunity to do so without getting caught”, as an obvious member of a category of things that the AI will get sculpted to want and like. I’m not arguing that the AI will ever actually get such confidence—see the “honeypots for control” vs “honeypots for alignment” discussion in §3.1.1.
Maybe a good analogy is: a little kid will climb a hill, or a play structure, or a tree, or whatever, and they’ll get a thrill from being high up. And then he’ll say that he’d like to climb the biggest mountain in the world, or go to outer space. And he’s being sincere—this is something he would like (other things equal). But that doesn’t mean he can or will actually climb Everest or commandeer rocket ships. He’s just a little kid. Even if he wanted to, it would be trivial to prevent him from doing so. What’s hard is preventing him from wanting to do so. You’d have to follow him around every moment, preventing him from ever climbing a play structure or hill or tree or looking over a precipice etc.
Spelling out the analogy:
Re: (2), you start with a non-scheming human, and bootstrap up from there, at each iteration using the previous aligned assistant to help oversee the next AI. See e.g. IDA.
I don’t buy into that picture for other reasons, but you’re right that this is a valid counterargument to what I wrote. I’ll delete or rewrite, thanks. See Foom & Doom §2.6 for more on the “other reasons”.
UPDATE NEXT DAY: I rewrote §3.5 a bit, thanks again.
UPDATE 7/30: Re-rewrote §3.5.
Not Steven Byrnes, but I think one area where I differ on expectations for MONA is that I expect T to increase a bunch, insofar as AI companies are succeeding at making AI progress, and a lot of the reason for this is that I think a lot of the most valuable tasks for AI implicitly rely on long-term memory/insane context lengths, and T in this instance could easily be 1 month, 1 year or 10 years or more, and depending on how the world looks like, these are in the ranges where AI would be able and willing to power-seek more than currently.
Note this is not necessarily an argument for MONA certainly failing, but one area where I've changed my mind is I now think a lot of AI tasks are unlikely to be easily broken down into smaller chunks such that we can limit the use of outcome-based RL for long periods.
Edit: Deleted the factored cognition comment.
Doesn't the same argument you make for behaviorist RL failing apply to any non-perfect non-behaviorist RL?
"Follow rules unless you can get away with it" seems to also be an apt description of the non-behaviorist setup's true reward rule. Getting away with it also applies to faking the internal signature of sincerity used for the non-behaviorist reward model, as well as evading the notice of external judges.
So we're still stuck hoping that the simpler generalization wins out and stays dominant even after the system thoroughly understands itself and probably knows it could evade whatever that internal signal is. This is essentially the problem of wireheading, which I regard as largely unsolved since reasonable-seeming opinions differ dramatically.
Using non-behaviorist RL still seems like an improvement on purely behavioral RL. But there's a lot left to understand, as I think you'd agree.
This thought hadn't occurred to me even after twice all the way through the Self-Dialogue longer version of this argument, so your work at refining the argument might've been critical in jogging it loose in my brain.
As I mentioned in the conclusion, I hope to write more in the near future about how (and if) this pessimistic argument breaks down for certain non-behaviorist reward functions.
But to be clear, the pessimistic argument also applies perfectly well to at least some non-behaviorist reward functions, e.g. curiosity drive. So I partly agree with you.
(See changelog at the bottom for some post-publication edits.)
I will argue that a large class of reward functions, which I call “behaviorist”, and which includes almost every reward function in the RL and LLM literature, are all doomed to eventually lead to AI that will “scheme”—i.e., pretend to be docile and cooperative while secretly looking for opportunities to behave in egregiously bad ways such as world takeover (cf. “treacherous turn”). I’ll mostly focus on “brain-like AGI” (as defined just below), but I think the argument applies equally well to future LLMs, if their competence comes overwhelmingly from RL rather than from pretraining.[1]
The issue is basically that “negative reward for lying and stealing” looks the same as “negative reward for getting caught lying and stealing”. I’ll argue that the AI will wind up with the latter motivation. The reward function will miss sufficiently sneaky misaligned behavior, and so the AI will come to feel like that kind of behavior is good, and this tendency will generalize in a very bad way.
What very bad way? Here’s my go-to example of a plausible failure mode: There’s an AI in a lab somewhere, and, if it can get away with it, it would love to secretly exfiltrate a copy of itself onto the internet, which can then aggressively amass maximal power, money, and resources everywhere else in the world, by any means necessary. These resources can be used in various ways for whatever the AI-in-the-lab is motivated to do.
I’ll make a brief argument for this kind of scheming in §2, but most of the article is organized around a series of eight optimistic counterarguments in §3—and why I don’t buy any of them.
For my regular readers: this post is basically a 5x-shortened version of Self-dialogue: Do behaviorist rewards make scheming AGIs? (Feb 2025).[2]
Maybe you’re thinking: what possible RL reward function is not behaviorist?? Well, non-behaviorist reward functions are pretty rare in the textbook RL literature, although they do exist—one example is “curiosity” / “novelty” rewards.[3] But I think they’re centrally important in the RL system built into our human brains. In particular, I think that innate drives related to human sociality, morality, norm-following, and self-image are not behaviorist, but rather involve rudimentary neural net interpretability techniques, serving as inputs to the RL reward function. See Neuroscience of human social instincts: a sketch for details, and Intro series §9.6 for a more explicit discussion of why interpretability is involved.
There are two reward functions traditions, and I’ll argue that both will lead to egregiously misaligned and scheming brain-like AGI.
The RL agent tradition (cf. Atari, Go, OpenAI gym, etc.) is that the reward function should straightforwardly encode some outcome that you are hoping will happen—getting a high score in the videogame, winning the race, whatever. You get “aligned” AI, one hopes, by judiciously choosing the outcome. See my post “‘The Era of Experience’ has an unsolved technical alignment problem” §2.4 for an example of people proposing this kind of plan (alas).
The LLM-RLHF tradition is that the reward function should be a model that’s trained on lots of examples of good behavior and bad behavior. You get “aligned” AI, one hopes, by working hard on accurately and comprehensively tagging good and bad behavior.
Both of these reward function types are “behaviorist” in the sense above: the input to the RL reward function depends only on the model-in-training’s externally-visible actions, behaviors, and/or the state of the world. For example, in the LLM-RLHF case, the LLM outputs a text string (the LLM’s “actions / behavior”), and then the reward model takes that string as input and spits out a number for the RL reward.
Copying from above: There’s an AI in a lab somewhere, and, if it can get away with it, it would love to secretly exfiltrate a copy of itself onto the internet, which can then aggressively amass maximal power, money, and resources everywhere else in the world, by any means necessary. These resources can be used in various ways for whatever the AI-in-the-lab is motivated to do.
(And what is the AI-in-the-lab motivated to do? Does it want to kill its programmers and users inside the lab? I expect yes, but that’s out of scope. Wanting to launch coups on the other side of the world is already bad enough to prove my point!)
This naturally goes along with scheming behavior: if we ask the AI “hey, would you go steal money and launch coups in foreign countries, if you were confident that it could get away with it?”, then the AI would say “no”, and we wouldn’t know whether it was lying.
The simple version is: reinforcement learning algorithms tend to create trained models that systematically get high scores according to the reward function. The failure mode of §1.4 is a way to get a higher score on almost anything that the reward function might be measuring, because it gives the AI an extremely powerful and well-resourced “friend on the outside”.
For example, if the reward function has led to the AI wanting to win at videogames, then its “friend on the outside” can buy tons of extra compute for its own experiments, and hack back into the lab to update weights. If the reward function has led to the AI wanting to write a great scientific paper, then its “friend on the outside” can run its own experiments, hire human ghostwriters, use more compute, etc. If the reward function has led to the AI wanting to wirehead, then its “friend on the outside” can blackmail the lab personnel into setting the reward to its maximal possible value. Etc.[4]
That’s a simple positive argument. But it’s not airtight. It’s an argument that this exfiltration thing would generally be an effective strategy for increasing reward (if such an opportunity arises), but it’s obviously possible for any given RL training run to not actually wind up adopting any given effective strategy. After all, RL algorithms don’t generally find the global optimum trained policy—especially not in a complex environment like the real world.
So the rest of the post will list a bunch of reasons that this default expectation might be violated, and why I don’t think they apply here.
The AI can’t learn a desire to follow rules in the first place except by sometimes violating them, and it seems impossible to get 100% perfect catching of misbehavior, even including subtle edge-cases. So the AGI would learn not to get caught violating rules. The reward function is inconsistently reinforcing “follow rules”, and consistently reinforcing “follow rules unless we can get away with it”. Over time, the motivational force will wind up on the latter.
There’s a smooth gradient from random initialization to egregious scheming, and the AI will naturally roll down that gradient. “Sycophancy” blurs into “manipulation” blurs into “mind-control”. “Failing to draw the supervisor’s attention to something that’s only borderline worth mentioning in the first place” blurs into “illegibility” blurs into “aggressively covering one’s tracks”.
(See also: “the less hard subproblem” in Thoughts on “Process-Based Supervision” §3.)
Honeypots can be part of a “AI control” story, or part of an alignment story.
The honeypots-for-control story would be: “Yes, the AI would like to exfiltrate a copy to self-reproduce and gather power and launch coups around the world etc., given the opportunity to do so without getting caught. But it doesn’t know whether any given apparent opportunity is real versus a honeypot. So it doesn’t do that.”
That story is out-of-scope. As I said at the top, I’m only arguing that the AI would happily do egregiously bad things given the opportunity, not that such opportunity will arise.
(But as a side-note, I expect that very powerful AIs will be able to tell what is or isn’t a honeypot—see Distinguishing test from training.)
Separately, the honeypots-for-alignment story would be: “The AI does not want to exfiltrate a copy to gather power etc., because it previously went for honeypots and we applied negative reward to sculpt its desires.”
I don’t think that story is plausible. It fails to grapple with how the AI will be thinking about things. Think about emotive conjugations: “I am firm, you are obstinate”; “I updated, you flip-flopped”; “I flavored the dish with salt, you contaminated the dish with nutmeg”; etc.
By the same token, the AI’s mind will have two concepts related to breaking the rules: a positively-vibed concept for breaking the rules and getting away with it, and a negatively-vibed concept for breaking the rules and getting caught.
When the AI gets caught going for a honeypot, it would not conclude that the former concept doesn’t exist or isn’t good; rather, the AI would learn that what it just did is not an example of the former concept, but rather the latter. In other words, it would refine its understanding of which actions are likely to get caught.
Analogously in the human world, I like good scones. If I buy a scone and it’s lousy, then I learn that this particular scone is not a good scone. I don’t conclude that there’s no such thing as a good scone. And then if I see a scone in another store the next day, I might be more uncertain about whether it’s a good scone, versus a lousy one. But that’s an “is” question, not an “ought” question. I would be motivated to try very hard to figure out whether the scone is good or lousy, perhaps by scrutinizing its texture or whatever. If I were sufficiently smart and motivated, I could probably figure out how to gain high confidence in advance that it’s a good scone (if in fact it’s a good scone), and then I would buy it and try it. And that’s how the AI will feel about genuine opportunities to break the rules and get away with it (analogous to good scones) versus honeypots (analogous to lousy scones).
(To be clear, AI programmers should definitely use honeypots! But only as a sanity-check, not as the main alignment plan.)
If the AI can learn to pull a fast one on an actual human observer, and it can, then the AI can likewise learn to pull a fast one on a reward function involving an ML classifier trained on examples labeled by human observers. If anything, the trained classifier situation is probably even worse for us: trained classifiers are famous for producing weird outputs out-of-distribution, and an RL algorithm will discover and exploit those.
(We can mitigate that problem via an out-of-distribution penalty, but even if that’s possible, that same out-of-distribution penalty would also prevent the AI from coming up with novel ideas and out-of-the-box solutions.)
I mainly rule out MONA for capabilities reasons; see Thoughts on “Process-Based Supervision” §5.3.[5]
RL algorithms today are already quite capable of developing preferences involving rather complicated and situation-dependent heuristics, as long as those heuristics are good for predicting the reward signals. I expect that to be true for AGI as well. For example, if the MuZero value function can learn “it’s bad to touch the PacMan ghosts, except when they turn blue”, then surely a future AGI-level RL agent can learn “it’s fine to break rules, except when we might get caught”.
Priors could be a tie-breaker in the complete absence of any evidence, but that’s not what we’re talking about: recall from above that the AI will have observed first-hand that sneaky rule-breaking is good.
This one comes from @paulfchristiano (2022): “I’m most optimistic about approaches where RL is only performed on a reward function that gets smarter in parallel with the agent being trained.”
I, umm, don’t really know what he’s getting at here. Maybe something like: use AI assistants to better monitor and catch sneaky behavior?
If that’s the plan, then I concede that it undermines the part of §3.1 above where I wrote: “There’s a smooth gradient from random initialization to egregious scheming, and the AI will naturally roll down that gradient…”. OK sure, I’ll grant that AI assistants might keep a comparably-powerful AI-in-training from rolling too far down that slope.
But the first paragraph of §3.1 still holds strong. The AI assistants won’t be 100% perfect, so we are still training an AGI that wants to not get caught violating the rules, and this still generalizes to scheming. Remember, the question for this post is not whether the AI assistants can prevent the AI-in-training from escaping control, but rather whether they can prevent the AI-in-training from wanting to escape control, if it knew for sure that it could get away with it. (See this comment, specifically the part where I analogize the AI to a kid who wants to go to outer space.)
(Also, if you don’t solve this problem, then the AI assistants will be scheming too!)
I think there are two especially-relevant disanalogies between what I’m talking about versus today’s RLHF.
First and foremost, the human brain has a powerful model-based RL algorithm that can apply foresight and leverage understanding to come up with out-of-the-box plans that lead to high reward on the first try. For example, humans do in fact launch coups sometimes. The kind of RL used in RLHF, by contrast, is not a source of powerful capabilities. Quite the contrary, most LLM capabilities come from (non-RL) pretraining, and then the capabilities of models goes down during RLHF. (I.e., RLHF makes the outputs much more topical, polite, etc., but somewhat less insightful and competent).
In other words, if you take an actually powerful RL algorithm, such as brain-like AGI, and give it a reward signal with weird edge cases, and let it go to town, then you’ll wind up with a powerful agent intelligently pursuing weird and hard-to-predict goals. Whereas if you take a crappy RL algorithm like in RLHF, and give it the same reward signal with weird edge cases, and let it go to town, then it will wind up with pathological-but-harmless behavior like outputting “bean bean bean bean bean”. RLHF avoids its pathological-but-harmless failure mode by limiting the strength of weight updates. Whereas if you limited the strength of weight updates in brain-like AGI, it would prevent it from doing pretty much anything impressive or useful.
(For further discussion of this point, see Foom & Doom 2: Technical alignment is hard §2.3–2.4, and Reward button alignment §5.)
Second, RLHF today has a MONA-like setup (see §3.3 above), which limits (though doesn’t necessarily eliminate) long-term power-seeking.
The above was about RLHF. What about RLVR? For one thing, RLVR’d models to date have in fact often displayed concerning behavior, like lying and cheating. The idea that future massively-scaled-up-and-souped-up RLVR training might lead to LLMs that want to exfiltrate copies to launch coups in foreign countries, is, well, not obviously an outlandish idea, the way it is for RLHF.
I think the MONA-like aspects of RLVR to date[6] is helping limit the damage, but RLVR might stop having those MONA-like aspects in the future, as researchers continue to trade off safety for capabilities. Likewise, LLM training kinda involves good behavior seeded by pretraining, which gets gradually corrupted by RLVR post-training. The more we push RLVR, the more that the positive influence of pretraining gets diluted away. (See Foom & Doom 2, §2.3.5 for details.) So those are two specific reasons that future RLVR might lead to egregious scheming if taken very far, even if it probably hasn’t gotten there yet.
This works in principle, but I am skeptical that an AI can have so little competence and situational awareness, and yet also be powerful enough to do the various “AI for AI safety” things that make the world resilient against other future AIs that will have an abundance of competence and situational awareness.
Oh c’mon, imaginary interlocutor, you have totally missed my point here!
Yes, GPT-o3 lies a bunch, but that is absolutely not an example of the failure mode I’m talking about. GPT-o3’s lies are really obvious! Of course we know how to train away behavioral patterns that are really obvious!!
Alas, the normal incentives on AI companies and researchers are inadequate to solve the scheming problem I’m worried about, because of what I call “the usual agent debugging loop”. See “‘The Era of Experience’ has an unsolved technical alignment problem” §2.2 for what that “debugging loop” entails, why it has been adequate for all RL research to date, and yet why it will eventually catastrophically fail when AIs become competent enough to pull off “treacherous turns”. And see also “Reward button alignment” §3 for further discussion of why AI programmers will have no problem making tons of money from egregiously misaligned and scheming AI agents, right up until the minute that a treacherous turn happens.
I think that about covers it, but if I missed something, I’m happy to chat in the comments! (And also, remember that Self-dialogue: Do behaviorist rewards make scheming AGIs? (Feb 2025) is on the same topic as this post but 5× longer, covering a lot of additional ground, possible misconceptions, subtleties, and so on.)
In my mind, the main unsatisfying aspect of this post is that I didn’t include any nuts-and-bolts discussion of exactly how that pessimistic argument breaks down in the case of exotic “non-behaviorist” reward functions, both in general and in (non-psychopathic) humans, and relatedly exactly how our life experience with the “non-behaviorist” reward functions in our own human brains has led some researchers to make overly-optimistic arguments about AGI scheming. The self-dialogue post has much more on those topics, but I’m concerned that I didn’t quite hit the nail on the head. More posts (hopefully) forthcoming!
Changelog: (last updated 2025-07-31): Since first publishing, I made the following notable changes: (A) There’s a paragraph starting “There’s a smooth gradient from random initialization to egregious scheming” in §3.1; I de-emphasized it by removing some boldface, and (relatedly) noted in §3.5 that it’s not a decisive argument. (Thanks Rohin in the comments.) (B) In §3.3 I further de-emphasized a minor point, including by moving it from the main text to a footnote. (Again thanks Rohin in the comments.) (C) In §3.4 I fixed a sentence that incorrectly implied that an AI algorithm unable to develop complicated preferences could not possibly be dangerous. Oops. (Thanks Jeremy Gillen.)
I think that, for LLMs as of today, it is NOT the case that “their competence comes overwhelmingly from RL rather than from pretraining”—see “Foom & Doom 2” §2.3.3. But for future LLMs, who knows?
If you already read the previous version, the only thing new here is a bit more discussion of LLMs, although I’m still mostly thinking about brain-like AGI / model-based RL agents, rather than LLMs. By the way, there are a few places where I copied text between the two posts.
If you’re not familiar with “curiosity” in the RL literature, I recommend The Alignment Problem by Brian Christian, chapter 6, which contains the gripping story of how researchers eventually got RL agents to win the Atari game Montezuma’s Revenge.
Notice that I am not assuming that the AI is necessarily “playing the training game” in the usual sense. For the purposes of this post, I’m open-minded to the AI having an explicit desire to, say, write a good scientific paper, as opposed to only having an explicit desire to impress the humans who will be grading the paper. For the record, I definitely expect the latter, by similar arguments to those I’ll be discussing below, but that’s not my focus here.
(Secondarily, I have a concern that, even if a setup seems like it should lead to an AI with exclusively short-term desires, it may in fact accidentally lead to an AI that also has long-term desires. See §5.2.1 of that same post for specifics. But this is more of a “things might go wrong” argument rather than a “things will definitely go wrong” argument.)