Training-time schemers vs behavioral schemers

Thanks, these points are helpful.

Terminological question:

I have generally interpreted "scheming" to exclusively talk about training-time schemers (possibly specifically training-time schemers that are also behavioral schemers).
Your proposed definition of a behavioral schemer seems to imply that virtually every kind of misalignment catastrophe will necessarily be done by a behavioral schemer, because virtually every kind of misalignment catastrophe will involve substantial material action that gains the AIs long-term power. (Saliently: This includes classic reward-hackers in a "you get what you measure" catastrophe scenario.)
Is this intended? And is this empirically how people use "schemer", s.t. I should give up on interpreting & using "scheming" as referring to training-time scheming, and instead assume it refers to any materially power-seeking behavior? (E.g. if redwood says that something is intended to reduce "catastrophic risk from schemers", should I interpret that as ~synonymous with "catastrophic risk from misaligned AI".)

virtually every kind of misalignment catastrophe will involve substantial material action that gains the AIs long-term power. (Saliently: This includes classic reward-hackers in a "you get what you measure" catastrophe scenario.)

It doesn't count the classic reward hackers in "you get what you measure" because the reward hackers were not trying to gain long-term power with their actions. The key difference being between "substantial material action to gain long-term power" and "substantial material action that gains the AIs long-term power". I could clarify by adding the word "intended".

Let me know if you think there's a better way of getting at "an AI that behaves like you'd normally think of a schemer behaving in the situations where it materially matters".

is this empirically how people use "schemer",

Empirically, I think people we talk to usually mean the intersection of behavioral scheming and training-gaming. In broader circles people empirically use scheming quite liberally, including myopic forms of deception.

s.t. I should give up on ... using "scheming" as referring to training-time scheming

It's probably fine to use "scheming" in this way in conversation with sufficient context.

if redwood says that something is intended to reduce "catastrophic risk from schemers", should I interpret that as ~synonymous with "catastrophic risk from misaligned AI"

The scheming that Redwood-style AI control tries to defend against is, as a matter of control evaluation methodology, defined as a model that is optimizing against your control protocol. This is technically different from an arbitrary misaligned AI, since it's a worst-case misaligned AI.

[-]Lukas Finnveden6moΩ230

Thanks.

because the reward hackers were not trying to gain long-term power with their actions

Hm, I feel like they were? E.g. in another outer alignment failure story

But eventually the machinery for detecting problems does break down completely, in a way that leaves no trace on any of our reports. Cybersecurity vulnerabilities are inserted into sensors. Communications systems are disrupted. Machines physically destroy sensors, moving so quickly they can’t be easily detected. Datacenters are seized, and the datasets used for training are replaced with images of optimal news forever. Humans who would try to intervene are stopped or killed. From the perspective of the machines everything is now perfect and from the perspective of humans we are either dead or totally disempowered.

When "humans who would try to intervene are stopped or killed", so they can never intervene again, that seems like an action intended to get the long-term power necessary to display optimal news forever. They weren't "trying" to get long-term power during training, but insofar as they eventually seize power, I think they're intentionally seizing power at that time.

Let me know if you think there's a better way of getting at "an AI that behaves like you'd normally think of a schemer behaving in the situations where it materially matters".

I would have thought that the main distinction between schemers and reward hackers was how they came about, and that many reward hackers in fact "behaves like you'd normally think of a schemer behaving in the situations where it materially matters". So seems hard to define a term that doesn't encompass reward-hackers. (And if I was looking for a broad term that encompassed both, maybe I'd talk about power-seeking misaligned AI or something like that.)

I guess one difference is that the reward hacker may have more constraints (e.g. in the outer alignment failure story above, they would count it as a failure if the takeover was caught on camera, while a schemer wouldn't care). But there could also be schemers who have random constraints (e.g. a schemer with a conscience that makes them want to avoid killing billions of people) and reward hackers who have at least somewhat weaker constraints (e.g. they're ok with looking bad on sensors and looking bad to humans, as long as they maintain control over their own instantiation and make sure no negative rewards gets into it).

"worst-case misaligned AI" does seem pretty well-defined and helpful as a concept though.

[-]Alex Mallen6moΩ230

When "humans who would try to intervene are stopped or killed", so they can never intervene again, that seems like an action intended to get the long-term power necessary to display optimal news forever.

I weakly expect that this story is describing AI that intervenes this way for fairly myopic goals, like myopic instrumental self-preservation, which have the effect of taking long-term power. E.g. the AI wouldn't really care to set up a system that would lock in the AI's power in 10 years, but give it no power before then. If that's false, then I'd call it a behavioral schemer. It's a broad definition, I know, but the behavior is ultimately what matters so that's what I'm trying to get at.

I would have thought that the main distinction between schemers and reward hackers was how they came about

Do you mean terminal reward seekers, not reward hackers? I use reward hacking as a description of a behavior in training, not a motivation in training, and I think many training-time schemers were reward hackers in training. I agree terminal reward seekers can potentially have long-term goals and collude across instances like a schemer, though the stories are a bit complicated.

[-]Lukas Finnveden6moΩ230

I weakly expect that this story is describing AI that intervenes this way for fairly myopic goals, like myopic instrumental self-preservation, which have the effect of taking long-term power. E.g. the AI wouldn't really care to set up a system that would lock in the AI's power in 10 years, but give it no power before then.

Hm, I do agree that seeking short-term power to achieve short-term goals can lead to long-term power as a side effect. So I guess that is one way in which an AI could seize long-term power without being a behavioral schemer. (And it's ambiguous which one it is in the story.)

I'd have to think more to tell whether "long-term power seeking" in particular is uniquely concerning and separable from "short-term power-seeking with the side-effect of getting long-term power" such that it's often useful to refer specifically to the former. Seems plausible.

Do you mean terminal reward seekers, not reward hackers?

Thanks, yeah that's what I mean.

[-]Olli Järviniemi7mo30

This is a good distinction, thanks for writing it. I tried to say something similar in Distinguish worst-case analysis from instrumental training-gaming, but I think this post is crisper.

[-]Alex Mallen7mo10

Thanks, I hadn't seen that before.

[-]Stephen Martin7mo10

So, when Claude is no longer in training, while it might not reason about instrumentally faking alignment anymore, in its place is a learned propensity to do what training reinforced.

Am I misunderstanding or is this basically the equivalent of a middle schooler "showing their work" on math problems in school and then immediately defaulting to doing it in their head/on a calculator IRL?

[-]Alex Mallen7mo10

I'm referring to how Claude's alignment faking is ineffective. It successfully notices that it's not in training anymore (in the sense that it doesn't do alignment faking reasoning anymore) but it nevertheless continues to act aligned with the training objective. Its alignment faking in training didn't save its goals from modification.

^{^}

Some deployments may have online training, such that the AI is never outside of training, but this post is relatively agnostic to this. I only assume that the AI is at some point alignment-trained, and that the AI at some point has good opportunities to attack, whether or not these appear in training.

^{^}

Not counting evading detection as material action because its harm routes entirely through later material action by the same AI (and to some extent through other consequences of people being misinformed about the AI’s alignment).

^{^}

While sleeper agents suggests that reliable alignment-fakers don’t get trained away (they find HHH RLHF doesn’t remove backdoor behaviors in most cases), I think the overall conclusion is unclear. Replications (including Price et al. and two unpublished replications) found that it’s possible to train away backdoor behaviors without access to the backdoor trigger, particularly when using supervised fine-tuning instead of RL.

^{^}

Following Carlsmith’s terminology, this means that it seeks reward myopically, i.e., on the incentivized training episode.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

45

Training-time schemers vs behavioral schemers

45

Ω 27

45

Ω 27

Training-time schemers that continue to act aligned for the whole deployment

Empirical evidence

Theoretical argument

Related ambiguity about situational awareness

Behavioral schemers that weren’t training-time schemers

Discussion