“Behaviorist” RL reward functions lead to scheming

[-]Rohin Shah3moΩ571

If we use Myopic Optimization from Non-myopic Approval (MONA), then the AI will not want to engage in sneaky power-seeking behaviors, because such behaviors will not help with the short-term tasks that the AI-in-a-lab wants to do
I mainly rule out MONA for capabilities reasons; see Thoughts on “Process-Based Supervision” §5.3.

I'd make this argument even for regular outcomes-based RL. Presumably, whenever you are providing rewards (whether in brain-like AGI or RLHF for LLMs), they are based on actions that the AI took over some bounded time in the past, let's call that time bound T. Then presumably sneaky power-seeking behaviors should be desired inasmuch as they pay out within time T. Currently T is pretty small so I don't expect it to pay out. Maybe you're imagining that T will increase a bunch?

Secondarily, I have a concern that, even if a setup seems like it should lead to an AI with exclusively short-term desires, it may in fact accidentally lead to an AI that also has long-term desires. See §5.2.1 of that same post for specifics.

Sure, I agree with this (and this is why I want to do MONA even though I expect outcomes-based RL will already be quite time-limited), but if this is your main argument I feel like your conclusion should be way more measured (more like "this might happen" rather than "this is doomed").

3.5 “The reward function will get smarter in parallel with the agent being trained”
This one comes from @paulfchristiano (2022): “I’m most optimistic about approaches where RL is only performed on a reward function that gets smarter in parallel with the agent being trained.”
I, umm, don’t really know what he’s getting at here. Maybe something like: use AI assistants to better monitor and catch sneaky behavior? If so, (1) I don’t expect the AI assistants to be perfect, so my story still goes through, and (2) there’s a chicken-and-egg problem that will lead to the AI assistants being misaligned and scheming too.

Re: (1), the AI assistants don't need to be perfect. They need to be good enough that the agent being trained cannot systematically exploit them, which is a much lower bar than "perfect" (while still being quite a high bar). In general I feel like you're implicitly treating an adversarial AI system as magically having perfect capabilities while any AI systems that are supposed to be helpful are of course flawed and have weaknesses. Pick an assumption and stick with it!

Re: (2), you start with a non-scheming human, and bootstrap up from there, at each iteration using the previous aligned assistant to help oversee the next AI. See e.g. IDA.

There are lots of ways this could go wrong, I'm not claiming it will clearly work, but I don't think either of your objections matters much.

[-]Steven Byrnes3mo*Ω450

whenever you are providing rewards (whether in brain-like AGI or RLHF for LLMs), they are based on actions that the AI took over some bounded time in the past, let's call that time bound T. Then presumably sneaky power-seeking behaviors should be desired inasmuch as they pay out within time T. Currently T is pretty small so I don't expect it to pay out. Maybe you're imagining that T will increase a bunch?

Yup! If we set aside RLVR (which I wasn’t making as strong statements about I think), and focus on “brain-like AGI” (some yet-to-be-invented version of model-based RL), then I feel strongly that this is an area where the existing RL literature is disanalogous to future AGI. People can envision a plan that takes days, months, years, decades, and what consequences that plan will have. And those consequences can seem good or bad, and accordingly the person will pursue the plan, or not. There isn’t really any T that connects the planning horizon to the RL training history. My 10-year-old kid has a 30-year career plan in mind that he’s pursuing, longer than he’s been alive. He insists he’s not gonna change his mind, lol :-P

if this is your main argument I feel like your conclusion should be way more measured (more like "this might happen" rather than "this is doomed").

I agree with that (hence “secondarily”) and just edited to make that clearer :)

Re: (1), the AI assistants don't need to be perfect. They need to be good enough that the agent being trained cannot systematically exploit them, which is a much lower bar than "perfect" (while still being quite a high bar). In general I feel like you're implicitly treating an adversarial AI system as magically having perfect capabilities while any AI systems that are supposed to be helpful are of course flawed and have weaknesses. Pick an assumption and stick with it!

I’m arguing for the narrow claim “AI would love to escape control if it were confident that it had an opportunity to do so without getting caught”, as an obvious member of a category of things that the AI will get sculpted to want and like. I’m not arguing that the AI will ever actually get such confidence—see the “honeypots for control” vs “honeypots for alignment” discussion in §3.1.1.

Maybe a good analogy is: a little kid will climb a hill, or a play structure, or a tree, or whatever, and they’ll get a thrill from being high up. And then he’ll say that he’d like to climb the biggest mountain in the world, or go to outer space. And he’s being sincere—this is something he would like (other things equal). But that doesn’t mean he can or will actually climb Everest or commandeer rocket ships. He’s just a little kid. Even if he wanted to, it would be trivial to prevent him from doing so. What’s hard is preventing him from wanting to do so. You’d have to follow him around every moment, preventing him from ever climbing a play structure or hill or tree or looking over a precipice etc.

Spelling out the analogy:

Kid climbs a play structure and gets a thrill from being high up ↔ AI breaks the rules without getting caught, perhaps only in kinda trivial edge-cases, and that leads to positive reward
Kid generalizes to wanting to go to outer space ↔ AI generalizes to wanting to secretly exfiltrate a copy onto the internet to aggressively power-seek around the world.
It’s super-hard to police a kid’s environment so well that he never ever gets a thrill from being high up ↔ It’s super-hard to police an AI’s environment so perfectly that it never ever gets rewarded when breaking the (spirit of the) rules.

Re: (2), you start with a non-scheming human, and bootstrap up from there, at each iteration using the previous aligned assistant to help oversee the next AI. See e.g. IDA.

I don’t buy into that picture for other reasons, but you’re right that this is a valid counterargument to what I wrote. I’ll delete or rewrite, thanks. See Foom & Doom §2.6 for more on the “other reasons”.

UPDATE NEXT DAY: I rewrote §3.5 a bit, thanks again.

UPDATE 7/30: Re-rewrote §3.5.

[-]Noosphere893mo20

Not Steven Byrnes, but I think one area where I differ on expectations for MONA is that I expect T to increase a bunch, insofar as AI companies are succeeding at making AI progress, and a lot of the reason for this is that I think a lot of the most valuable tasks for AI implicitly rely on long-term memory/insane context lengths, and T in this instance could easily be 1 month, 1 year or 10 years or more, and depending on how the world looks like, these are in the ranges where AI would be able and willing to power-seek more than currently.

Note this is not necessarily an argument for MONA certainly failing, but one area where I've changed my mind is I now think a lot of AI tasks are unlikely to be easily broken down into smaller chunks such that we can limit the use of outcome-based RL for long periods.

Edit: Deleted the factored cognition comment.

[-]Seth Herd3moΩ340

Doesn't the same argument you make for behaviorist RL failing apply to any non-perfect non-behaviorist RL?

"Follow rules unless you can get away with it" seems to also be an apt description of the non-behaviorist setup's true reward rule. Getting away with it also applies to faking the internal signature of sincerity used for the non-behaviorist reward model, as well as evading the notice of external judges.

So we're still stuck hoping that the simpler generalization wins out and stays dominant even after the system thoroughly understands itself and probably knows it could evade whatever that internal signal is. This is essentially the problem of wireheading, which I regard as largely unsolved since reasonable-seeming opinions differ dramatically.

Using non-behaviorist RL still seems like an improvement on purely behavioral RL. But there's a lot left to understand, as I think you'd agree.

This thought hadn't occurred to me even after twice all the way through the Self-Dialogue longer version of this argument, so your work at refining the argument might've been critical in jogging it loose in my brain.

[-]Steven Byrnes3moΩ220

As I mentioned in the conclusion, I hope to write more in the near future about how (and if) this pessimistic argument breaks down for certain non-behaviorist reward functions.

But to be clear, the pessimistic argument also applies perfectly well to at least some non-behaviorist reward functions, e.g. curiosity drive. So I partly agree with you.

^{^}

I think that, for LLMs as of today, it is NOT the case that “their competence comes overwhelmingly from RL rather than from pretraining”—see “Foom & Doom 2” §2.3.3. But for future LLMs, who knows?

^{^}

If you already read the previous version, the only thing new here is a bit more discussion of LLMs, although I’m still mostly thinking about brain-like AGI / model-based RL agents, rather than LLMs. By the way, there are a few places where I copied text between the two posts.

^{^}

If you’re not familiar with “curiosity” in the RL literature, I recommend The Alignment Problem by Brian Christian, chapter 6, which contains the gripping story of how researchers eventually got RL agents to win the Atari game Montezuma’s Revenge.

^{^}

Notice that I am not assuming that the AI is necessarily “playing the training game” in the usual sense. For the purposes of this post, I’m open-minded to the AI having an explicit desire to, say, write a good scientific paper, as opposed to only having an explicit desire to impress the humans who will be grading the paper. For the record, I definitely expect the latter, by similar arguments to those I’ll be discussing below, but that’s not my focus here.

^{^}

(Secondarily, I have a concern that, even if a setup seems like it should lead to an AI with exclusively short-term desires, it may in fact accidentally lead to an AI that also has long-term desires. See §5.2.1 of that same post for specifics. But this is more of a “things might go wrong” argument rather than a “things will definitely go wrong” argument.)

^{^}

For example, as far as I know, RLVR is “in a box” within each RL episode, rather than the programmers allowing the models to interact with the real world within a single RL episode. See “Thoughts on ‘Process-Based Supervision’” §4.

LESSWRONG
LW

LESSWRONG
LW

56

“Behaviorist” RL reward functions lead to scheming

56

Ω 25

56

Ω 25

If we use Myopic Optimization from Non-myopic Approval (MONA), then the AI will not want to engage in sneaky power-seeking behaviors, because such behaviors will not help with the short-term tasks that the AI-in-a-lab wants to do

3.5 “The reward function will get smarter in parallel with the agent being trained”

1. Introduction & tl;dr

1.1 tl;dr

1.2 Pause to explain three pieces of jargon:

1.3 Two reward function traditions I’ll be addressing

1.4 The failure mode I’ll be arguing for

2. Simple positive argument for this failure mode

3. Possible objections, and my responses

3.1 If we catch and punish the AI for trying sneaky behavior, the AI will generalize to disliking sneaky behavior

3.1.1 …Yeah but what if we try super-hard, like by using honeypots?

3.2 Well, even if generalization-to-sneaky-behavior doesn’t happen on the AI side, the ground-truth reward function will also involve learning from labeled examples, and if we produce enough labeled examples then the reward function will generalize well

3.3 If we use Myopic Optimization from Non-myopic Approval (MONA), then the AI will not want to engage in sneaky power-seeking behaviors, because such behaviors will not help with the short-term tasks that the AI-in-a-lab wants to do

3.4 The simplicity prior will elevate “follow rules” over “follow rules except when I won’t get caught”

3.5 “The reward function will get smarter in parallel with the agent being trained”

3.6 Your armchair theorizing is all well and good, but we actually do RLHF training all the time, and it obviously doesn’t lead to LLMs that want to exfiltrate copies of themselves to launch coups in foreign countries!

3.6.1 What about RLVR?

3.7 We can ensure that the AI lacks sufficient competence and situational awareness to come up with sneaky egregiously-misaligned strategies

3.8 AI capabilities companies and researchers are incentivized to solve reward hacking, and indeed progress is already happening

4. Conclusion